Willow Voice Assistant

Hey @kristiankielhofner, have you come across Brainchip? With no consideration of cost or development time, their Akida chip appears suitable to your use case of the wake word.

Got everything set up and have played around with it a little, one of the things I think would be nice to see eventually is what Google at least calls ‘continued conversation’. So I can say “Alexa, turn on my basement lights.” basement lights turn on “Turn off the bathroom fan” bathroom fan turns off “Thank you.” ‘you’re welcome’ message. It’s also nice for times when, with Google, I’ll set a reminder and then ask it to change the time or correct something it misheard.

1 Like

This would be awesome. One thing I often do with Google is ask for another light to be turned on /off after asking for one previously. Not having to trigger the wake work speeds things up and is more natural.

1 Like

Someone said “Then all you do is flash and talk.”. I’ve flashed it and it’s connected to HA. Wake Word is recognized EVERY time, but NONE of the commands are understood. Clearly I’m missing something obvious in HA. Any ideas?

Silly question, but do you have the Assist voice assistant set up?

1 Like

Devices weren’t exposed to Assist. Thanks for the lead. It was a HA issue, not Willow. But
 has anyone figured out how to pass Willow voice to Spotify or Sonos? I’m looking for “Hi ESP, play <artist/song>” type functionality and, asside from specific commands to play specific playlists configured in YAML ahead of time, it’s just not happening for me.

Sorry I got caught up elsewhere and disappeared on all of you!

Generally speaking “Hi ESP” is a terrible wake word
 I myself (with a Walter Cronkite-esque “no accent” American “accent”) have to annunciate the letters. For users with significant issues we’ve had much better luck with Alexa (which isn’t great obviously) but until we commission “Hi Willow”, something like it, or other community wake words these are our options. We’ve also exposed many more underlying speech engine configuration options to better tune the sensitivity/aggressiveness of wake word detection and we’ve seen much better results with these - to the point where we will be changing the defaults for Willow 1.0. Generally speaking in terms of wake detection, VAD, VAD timeout, etc our defaults in the early release are about as conservative as you can get.

We have the option to use one, two, or three channel wake and AFE - which allows for the fundamental ability to process a single mic, each mic channel independently, as well as the reference channel. Down the road we can make better use of this.

In terms of quantization, as you note there are “high perf” (unquantized) and “low cost” (quantized) options and from esp-sr resources docs we feel we should have the headroom for the “high perf” variants - which we should be able to obtain when commissioning a custom wake word (or even from Espressif for existing wake words).

In terms of other approaches (tflite, etc) my belief is that true wake word robustness (even for individual speakers) can only come from quality samples in the training process and that is the real value of the commercial espressif models and process (20k samples, 500 speakers - including 100 children, professional and controlled environments, strict specifications on recording parameters, tuning, selection, etc). Many open source/self trained wake word approaches have been (more or less) complete failures because they largely ignore these fundamentals (or require significant hardware resources). See the price point of the Mycroft devices for comparison
 Mycroft is largely in the commercial state they are because it turns out the total addressable market for a voice interface at a 10x price point (compared to Alexa, Google Home, etc) is tiny (non-existent). Not to mention these commercial voice interfaces are largely sold as loss leaders with the expectation they drive further revenue for the various ecosystems of the commercial providers, which artificially reduces the market price points.

For 1.0 we are working on easy to flash binaries combined with completely dynamic configuration through a web interface. At this point users will be able to define arbitrary on device multinet commands which (like Willow Inference Server mode) are passed directly to HA as text - if you can configure the intent to do something in HA, we’ll send it.

For on device commands only English is supported currently. However, when using Willow Inference Server (which you can self-host) we support all of the languages of Whisper. We currently have users using at least English, Spanish, Portuguese, French, Dutch, German, and Korean with WIS and Willow.

We’ve seen them and others. It’s problematic because of the price point I referenced earlier - the really compelling thing about leveraging the ESP BOX is that it’s (as we say) more-or-less $50, take it out of the box, flash, and “put on your kitchen counter”. That’s a “final product delivered to the end-user, with worldwide distribution, and international/national certification” price point that none of these other approaches will get even remotely close to (see my earlier comment about Mycroft).

Continued conversation and even “turn by turn” is something we are certainly investigating. Likely not for 1.0 but there aren’t any fundamental limitations that prevent implementing this.

We have at least one user I’m aware of that has reported this working on our Github (no idea how). Generally speaking our role is actually pretty simple - wake and get clean speech to text to HA. As you saw from Assist, if HA can match it and do something it’s transparent to us. We’re actually thinking about a community repo for expanded HA intent configurations for use with Assist to more easily dramatically expand the capabilities of HA for commands.

Nope the current multi-channel wakewords form esspressif are WakeNet8 & WakeNet9
https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/benchmark/README.html

My memory is terrible and the docs have been updated but as here is an older doc.

Model Type Parameter Num RAM Average Running Time per Frame Frame Length
Quantized WakeNet3 26 K 20 KB 29 ms 90 ms
Quantised WakeNet4 53 K 22 KB 48 ms 90 ms
Quantised WakeNet5 41 K 15 KB 5.5 ms 30 ms
Quantised WakeNet5X2 165 K 20 KB 10.5 ms 30 ms
Quantised WakeNet5X3 371 K 24 KB 18 ms 30 ms
Quantised WakeNet6 378 K 45 KB 4ms(task1) + 25 ms(task2) 30 ms

Whilst
https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/benchmark/README.html#resource-occupancyesp32-1

But all are quantised, in fact on a microcontroller you will not be running unquantised on a microcontroller even the mighty s3.

Mycroft is no longer and I can tell you about there wakeword and users who found the oddly named Precise wake word system not much better than the Esp32-S3-Box because the training methods they used are as flawed as the Sonopy MFCC pyhton alg it uses.
I know exactly what was entailed in the Mycroft system as I do in the Esp32-S3-Box now I have refreshed my memory.
They are using the latest int16 & int8 quantised models of a multi-channel KWS of WakeNet8 or WakeNet9

Also your belief is wrong as the sample really need to be of hardware and environment of use and the more the better, but its a complete misnomer as you want the signatures of the use environment not a pro studio and equipment.
Its true that most opensource fails due to datasets and its something I have been banging on about for some time, but the amount of samples is dependent on use and for a global Big Data KW 500 speakers - including 100 children is likely inadequate even on my own KW trains I often have 100K samples that have been augmented to create numbers.
Big Data does actually send to the cloud and not for any clandestine monitoring but to create huge datasets of the hardware and environment of use.
For most environents there is usually only 1 or a few speakers and for that use that is all the speakers you need, I have created excellent KW of my voice that are useless for anyone else.

The actual paper espressif reference is https://arxiv.org/pdf/1609.03499.pdf which is basically an early 2016 BcResNet and its highly likely Espressif are using 1 of 2 frameworks to train there models either Tensorflow or Pytorch then quantising and exporting.
Its highly likely the espressif blob is tf4micro using a bcresnet(WakeNet8/9) as the WakeNet5 is from look of it a CRNN which is a streaming model.
Which likely makes sense from the number of params they have quoted.

Irrespective of the above the Esp32-S3-Box is using “high perf” variants as really they don’t exist as that is really the ADF parts and maybe the AEC-BSS-NS can be tweaked as too squeeze in they could be ruuning @LOW_COST, but you are already using the ‘best’ models and they don’t seem to be very good.

That is true to some extent as Google makes relative small loss whilst Amazon leaks like a sieve.
The ÂŁ400 Mycroft unit was ÂŁ400 because it had zero engineering and was just a Pi4 with an expensive beamforming board that google has dropped to lower cost 2 mic blind source seperation as does the the espressif ADF.
The economies of scale and use of custom silicon greatly reduces cost and prob with the economies of sale Google has ÂŁ89 is not that far off cost price to them.

The biggest problem with the likes of Mycroft is not applying some laterial thought and not copying consumer models verbatum.
A multi-room setup needs only a single server based brain and each room just needs low-cost ears.
Standard esp32-s3 as cute liitle low cost T7 S3 – LILYGO¼ are only $7.32 and seen similar clones as low as $5.

Trying to compete with consumer e-waste is likely futile and really opensource should be thinking out of the box to create something better.

I prob know a little more about KWS training and models in use and likely also what types of mics / adc to use.
I have been procrastinating for over a year about diving into C with the IDF & ADF but that steep initial learning curve keeps putting me off.
My name backwards is rolyantrauts and even though I was active with Rhasspy I think its took a massive wrong turn and also been totally done over by Raspberry (stock)
So as well as the dodging learning the espressif IDF I don’t think there is any good opensource central voice assistant software where prob WeNet Community · GitHub for purpose is much better than Rhasspy and Mycroft for me flirted dangerously near kickstarter fraud and was never really any good.

So my lazy ass has done nothing as not only is an esp32-s3-ear steep learning curve creating a linux interoperable inference based skill system is also a whole load of other work and stopped short there.

I think with the ADF its very possible to create super low cost websocket KWS based ears that act as distributed mics where 1 or more can be situated in a room.
The fact the Espressif WakeNet is binary blobs with no opensource that you can hack into a custom solution for me is an absolute cul-de-sac and as said the demonstrator KWS are extremely poor.

I’m having pretty significant difficulty completely unpacking your reply but the general sense I get is that you’re right, we’re wrong, we’re going about this all wrong, and as a result our implementation is somewhere between “not very good” and “extremely poor”. A little background before I continue: if it may seem like I’m bothered or personally offended by any of this I absolutely am not. I’ve been on the internet since 1993 and started my first open source project in 2004. My skin is made of kevlar and I’m not bothered or offended in the slightest!

I genuinely and truly don’t intend to sound snarky and we do have the best of intentions: we don’t have any monetary gain from Espressif hardware or relationship of any kind with them. We have no intention of in any way monetizing the open source community - regardless of approach.

If you have or know of a superior approach or implementation(s) that actually exist, in the real world, with this functionality, as a finished product at this price point, that you can actually buy in reasonable quantity, that in the end actually delivers an overall superior user experience we would absolutely be interested in learning from it.

3 Likes

You can get I2S mics from ali-express and elsewhere for a couple of $.
Same with I2S ADC modules they are cheap.
I have seen clone S3 dev kits as low as $5.

Prob my end pref would be uni-directional electrets via a max9814 preamp via as it contains a great analogue AGC so combine with digital it can give great farfield.
The design of a 2mic board for any esp32-s3 dev kit is fairly simple and likely prob could source the everest ADC’s that are used in the box.
Likely the only thing missing is a cheap housing.

The rest of the esp32-s3-box is a great demonstrator but really not needed but the KWS models they demonstrate are not very good and closed sources blobs and any custom trained maybe a different KW but likely the same quality.
Likely if you jetisoned much you could with the resources released fit a bigger and better parameter model, but there are just dev kits and components.

You may have a finished product with esp32-s3-box/lite but like many of us who have tested those models the user experience is poor


Actually thinking about its likely in the box its the wakenet8/9 bcresnet type model that is not streaming.
Whilst waiting for your reply I was trying to remember more from the experience and I had a gut feeling with the tight resources some KW misses where due to the rolling window of the audio input.

Maybe you could revert to KWS only and use the bigger Quantised WakeNet5X2 streaming model and that might give better results.

When you have none streaming you can send 1sec chunks to the KWS but often the KW might be 0.5 in and it gets cut in 2 and not recognised.
So you create a rolling window of 2x 05sec chunks and feed the KWS the 1sec rolling windows but at 2hz.
Even that can at times cut samples so you create 4x rolling window of 0.25 but now you have x4 KWS load.

Streaming uses a recurrent network using LSTM ( Long Short-Term Memory layer) that takes in 32ms chunks but remembers the results of the other chunks in the overall 1sec window.
So maybe the WakeNet5X2 might give more satifactory results but will require some jetsoning to get it to squeeze in.
I didn’t think ESP supported LSTM but obviously it does.

I’m guessing that this needs to be configured during the initial build steps and can’t be changed once a device is flashed, up and running
?

On that, to update the device do we have to run the whole build process from scratch?

Thank you for the pointers on hardware! However, this isn’t exactly groundbreaking or new knowledge to anyone who’s even attempted a rough DIY at this (Adafruit sells the max9814 - it’s not exactly obscure). The ESP BOX is sold as a “dev kit” for a reason - it’s a good (I’d argue better than good) effort at a reference design that can pass as a finished product at a compelling price point. What you have suggested here is a BOM, shopping list, assembly with a breadboard at best, and aesthetically a “jumble of wires” to most people. Far from “take it out of the box and put it on the kitchen counter” as I like to say. Speaking of which


Not exactly


Design - time, effort, and money
PCB - time, effort, and money
Housing - time, effort, and money
Assembly - time, effort, and money
Packaging - time, effort, and money
International Distribution (or shipping + customs) - time, effort, and money
Regulatory certification in various countries around the world (I doubt the markings on your S3 clone are legitimate) - time, effort, and money


and more at unknown/uncommitted volumes. This is a very long way in time, effort, and cost from someone approaching an “average” user, in the real world, purchasing something in their country for roughly $50, flashing it, placing it in their environment, and making use of it. As I was once told: ideas and talk are cheap. Unless they are backed with effort, commitment, and especially in this case (hardware) likely fairly significant capital they are essentially an irrelevant fairy tale. These issues (and more) all contribute to a very real fact: if it were as easy as you purport it to be someone would have done it by now. From the sounds of it, maybe even you!

I know this is my project and like any “proud parent” my perspective is likely skewed but this conclusion simply does not generally align with what we are seeing and hearing from end users - even when factoring in the general negativity bias that you always experience with feedback. If you’re unfamiliar, very few satisfied people bother to give praise and the vast majority of any observable feedback is negative.

As best as we can tell, from the at least hundreds (likely thousands, paying attention to reseller stock and re-stock) of ESP BOX units sold around the world since the Willow release, we’re only familiar with a handful of reports of wake issues. Almost all of which we have been able to address by changing some of our original very conservative AFE parameters. Believe me - when people spend real money on real hardware for an express purpose and it doesn’t go well you will absolutely be hearing about it.

You only created the repo “created_at”: “2023-03-31T14:05:15Z” so how many that is since release who knows.

Guess we shall have to see what user feedback you get from the KWS as for me its too hit and miss and just my opinion.

I’m not sure how when the repo was created is relevant but what I can tell you is the very soft initial release (repo went public) and post to HN was 31 days ago. Within roughly 36 hours of that post the ESP BOX (and Lite - which we didn’t even officially support) was sold out worldwide (see many, many posts regarding hunts for stock). To my dismay it was even picked up by Ars Technica
 From our first glances at stock from a few retailers prior to (soft) launch this was at least several hundred units. In the past few weeks Mouser, PiHut, and many other retailers have obtained much more stock and several have sold out again.

Espressif has clearly taken notice of the uptick in sales as they incorporated much of our hardware feedback (from our Wiki) into the next ESP BOX revision they tweeted about today.

Indeed! I appreciate the respectful exchange on this. In the end, this is an open source project with no monetization model for these users. Frankly if the experience and/or reaction was anything like what you are describing/predicting we would have likely significantly changed course or given up by now. I do open source projects for fun and to provide something of value. Not to make my life miserable.

When the ESP BOX first sold out, on our initial release, I was terrified. I honestly thought I’d post it to HN and maybe a dozen devs would take interest. As you note the repo was just over a month old at that point and we’d really only done testing with a few speakers in a few environments. With the level of response we received and our early state I was convinced I’d be waking up daily to floods of issues, reports, etc from angry users having horrible experiences and “Willow life” overall would be very painful.

More than a month later that still just hasn’t been the case. You’re clearly familiar with our repo - feel free to check through all of the issues, discussions, dig through social media, etc. What you’re describing just isn’t there.

I don’t need to be doing this, I’m not making any money from it, and I wouldn’t have any issue admitting to and recognizing (yet another) failure on my part - and there have been PLENTY.

It currently needs to be configured before build. We’re still a couple of weeks away from our 1.0 release with dynamic configuration for any number of devices, easy flashing, and OTA updates (preview).

That said for now you can just run ./utils.sh config, go to Willow Configuration, select Alexa under Wake Word, run ./utils.sh build and flash. You don’t need to run all of the steps again and it should only take a couple of minutes (tops).

1 Like

Everything is good with willow !! it works fine for me ! I Hope tts in French come soon !
My dream is that all the process that in local but sadly infĂ©rence server doesn’t work on Mac :wink: In my home , only Mac OS or Raspberry :slight_smile: thanks to the team for all the good work in a so short time . My google assistant are still useful but not so far .
thx from a French user.

1 Like

Great to hear!

We will be completely revamping TTS support in a future release of Willow Inference Server. We’re currently preparing for our 1.0 release and unfortunately multiple language TTS won’t make the cut but it is a top priority for post 1.0.

We could potentially support Apple Neural acceleration for Macs but performance would be hit or miss - the entire ecosystem would need to implement it across the various frameworks we use and it’s just not there yet.

In terms of Raspberry Pi we have no plans to support it - the performance we require for speech to text, text to speech, etc is fundamentally impossible. See comparison benchmarks here. Barely realtime with tiny (poor quality) and even worse performance beyond is not something we will officially implement - it’s impossible to make it work with the quality and response time anyone would tolerate.

Willow Inference Server does work on x86_64 CPUs and while the performance of WIS on CPU is best in class it’s still a very rough user experience. Our default model of medium with beam size 1 takes 5.5 seconds to transcribe 3 seconds of speech on a AMD Ryzen 7 2700X. Even this CPU is completely unacceptable for our user experience standards.

I ran through this a couple of nights ago and all seemed well. However, despite Willow detecting the new wake work correctly and even printing my verbal request correctly on the screen, it then says ‘sorry I didn’t understand’.

Is there a way to confirm it is correctly communicating with Home Assistant?

I finally got the ESP32-S3-BOX hardware in, and after some wrestling w/ WSL2 and getting all the bits working under windows, I now have a functional, 100% local solution that shows hints of being able to replace Alexa.

Words cannot express how stoked I am about this, the entire package is shockingly well put together for a pre-1.0 release and I am deeply grateful for the work you and your team have done so far @kristiankielhofner .

Pay no mind to the haters, if they have something working better they can go ahead and release their own code (which we all know won’t happen). This is top shelf work and it’s fricken awesome that I can use it today, for free, and with amazing functionality already in the box.

7 Likes