Willow Voice Assistant

Interested to hear your feedback on the hardware!

A couple points:

The factory demo from Espressif (even latest in their Github) uses a significantly out of date wake and local speech recognition engine and model. We did quite a bit of work to bring the latest and greatest to Willow. if you haven’t done microcontroller programming in C before, NOTHING is easy and tasks like that are borderline monumental. But it’s worth it - the latest model we’ve incorporated is SIGNIFICANTLY better.

We have a hardware guide that talks about the hardware:

My two biggest issues are:

  1. The 3d printed enclosure. The plastic is soft and retention on the screws isn’t great. That’s why you found the screw rattling in the box. Now that ESP Boxes have sold out around the world and Espressif is seeing real sales volume for the first time we anticipate they will move to real injection mold plastic that addresses the translucency and screw retention issue.

  2. That damn power LED. Ohhhh man, I hate that thing. One would think it would be controlled via GPIO but they (for some reason) connected it directly to a 3.3v buck converter coming off the input voltage… With the translucency of the case and full duty the green power LED is bright enough to see from space, and it makes the enclosure look even cheaper. Willow inits the LCD ASAP in the boot process so the user gets nearly immediate feedback it’s on so we don’t need it. I’ve taken to slightly opening the enclosures and snapping the power LED off the PCB. You can also touch the display to wake it up at any time, and the hardware microphone mute button on the top of the enclosure works with a status LED as well.

  3. When you configure Willow you can enter your WiFi credentials and Home Assistant server address and personal access token. So now you are on WiFi and talking to Home Assistant.

Then all you do is flash and talk.

Let me know how it goes, happy to help if you run into any snags! Only issue is I keep getting throttled by the forum and I’m currently being limited to one post per hour. I just enabled discussions on the Willow Github repo as I’m having many of these conversations across the internet and it would be much more helpful for all to move users who have hardware and are getting started with Willow there:

2 Likes

Am I the only one who smells snake oil salesman? Granted weird thing to pitch your boat on, but every response is just a sales pitch for something that’ll be released ‘later’, no reply on any media/website has sounded genuine & is just basically ‘my team is the best…’

E.g.
If someone asked you if the sky was blue or pink? your response would seemingly be ‘Well my team has worked hard on the problem, we’ve used nasa’s framework, upgraded it to the latest version, tested it and removed the bugs, added a lot of our own code & aim to be the best colour detector out there. You can ignore your eyes, we’ll be better than them. Just ask us and we’ll know’…aka you haven’t answered the basic blue or pink question

It can be pink at sunset :stuck_out_tongue_winking_eye:

I mean, let’s wait and see. A lot of the other people have went and bought one, and you will get to hear their testimonies.

No real reason to be the sceptic before that happens.

1 Like

Or to be septic!

1 Like

If there was any sign of commercial interest maybe, but I just read enthousiasm and dedication. Thanks @kristiankielhofner / team tovera for your effort. I second the feature request for a “buy me a coffee” link.

5 Likes

I order one from Pi-hut as well . Now all sold out :frowning:

1 Like

Willow is a month old and no one outside of my team knew it existed before Monday. As they say “Rome wasn’t built in a day”.

If you were to go back 10 years and look at the first month of Home Assistant (or any other very early project) you would see the exact same types of responses. I know this because I’ve been in open source for over 25 years and I installed the first release of Home Assistant 10 years ago. Needless to say Home Assistant in 2013 wasn’t what it is today…

I’m genuinely curious on your perspective here, what are we “selling”, exactly? We have nothing to do with the manufacture, distribution, or sales of the ESP Box - we don’t make a penny from them. We have no way to accept donations. If I’m a “snakeoil salesman” I’m really bad at actually profiting from it :slight_smile: .

9 Likes

This is a very good point!

I think what people are kind of missing here is we’re not seeing much online about Willow because it uses hardware that was obscure until Monday and people have to buy them. Then they need to be shipped, etc and this imposes a delay compared to something you can install on a Linux box, Raspberry Pi, etc and post of video of online in 10 minutes.

Not everyone everywhere has five hour Amazon delivery. If you actually have an ESP BOX you ordered on Monday in your hands today you probably sprung for some very fast shipping! Many people ordered them from Ali and that’s a two week delivery at best…

That said, here’s an issue and demo video from a user in the UK from yesterday:

Many more to come!

Hi, I’m user who posted that video on github showing it controlling my office lights. Thought i’d add some context, since not many people have got the hardware yet.

I ordered an ESP Box from Pi-Hut in the UK on Monday when I heard about Willow, recieved it 2 days later on Wednesday.

I was expecting it to be a lot of hassle to get working, but was pleasantly suprised - the installation instructions in the readme all worked first time. I’ve programmed a bit of microcontroller stuff, python, C, used docker etc… before, and am prepared to get my hands dirty, but none of that knowledge was really necessary.

The wakeword, which i changed from “Hi ESP” to “Alexa” as part of the config before building, works well. Much better than any of the “plug a mic into a raspberry pi” type of projects i’ve tried before.

I’ve not moved the device from my office to the kitchen, which is where i currently use a google home mini, so not tested in a noisy env or larger space yet, but i’m liking what i’m seeing so far.

As for the speech recognition (the bit after the wakeword detection) – tried the on-device multinet stuff, and it’s impressive for such a low power device, but it’s never going to be as good as shipping the audio to whisper.cpp [Edit: not whisper.cpp per se, see reply below]

When configured to use the inference server – running whisper on some graphics card somewhere by the willow authors i presume – the accuracy is excellent. You can say anything and see the transcription. If it matches something HA understands, it works to control things just fine.

Here’s another demo, see the video description on youtube for more details:

I am excited and optimistic there might finally be a path to decent voice control stuff I can run locally (looking forward to running the inference server myself soon) :muscle:

2 Likes

Hey RJ! One slight tweak so there’s no confusion. We don’t use whisper.cpp, we use a highly optimized Whisper based on ctranslate2 with our own additional performance optimizations.

Everyone can look for the release of the Whisper Inference Server next week to host locally! One caveat, though. Our goal is to best Alexa in every way possible. The fastest Whisper CPU implementation in the world (whisper.cpp) running on the fastest CPU on the market gets bested handily by a $100 used GTX 1060 that’s actually lower power (or even better a Tesla P4 that’s single slot, passively cooled, and uses PICe slot power only with a max of 60 watts). We’re targeting sub one second response times all-in and currently GPU is the only way to do that.

The Willow Inference Server can run on CPU but GPUs are just so fundamentally better suited to tasks like speech recognition the performance and response times are pretty frustrating compared to GPU.

1 Like

Thanks for the correction @kristiankielhofner - even better news i guess :slight_smile: Looking forward to trying it out!

1 Like

Er… Yes?

The hardware (from which Willow make nothing, apparently) has surpassed expectations by quite a long way - compared with my Rhasspy and ReSpeaker 4 Mic Array, for example, no contest.

The only stumbling block I have is “all you do is flash and talk”, never having done any flashing before. Can somebody point me to an idiots guide?

Coming back to this, I’ll explain the monetization strategy for Willow.

I’m a serial startup founder and after a couple of successful exits I needed to take some time off and spend my time working on things that I think are interesting, like Willow.

But since then I’ve also either consulted with or advised everything from startups to Fortune 500 and on various projects I can think of at least three specific projects/conversations/etc where enterprise/commercial users are looking for a “de-Amazoned Alexa”. For example, I worked on a project with a large healthcare company that wanted to introduce voice into patient care settings. Everyone knew from the start that NO ONE, and absolutely NO ONE would be ok with walking into a hospital room and seeing an Amazon Echo there. Additionally, for obvious reasons healthcare is extremely concerned about protected patient information, control over where data goes, etc. They’re not touching Alexa or anything like it with a 10ft pole.

Since starting my first open source project in 2004 (which became my first startup) I’ve always believed in open source, the advantages it provides, and how you can actually come up with successful open source business models that are not in conflict with your open source users.

You can probably see where this is going by now… Willow and the Willow Inference Server are open source and always will be. Frankly I have no interest in monetizing the open source community user base, my monetization focus is on what I’m good at - enterprise.

Of course the benefit here is for this strategy to work we need to make Willow truly best of breed. Alexa with no compromises and beyond. Enterprise pays for it, the open source community benefits from it.

6 Likes

You inspired me to make my own demo!

I get into more technical detail here and give a preview of Willow and the Willow Inference Server (releasing next week) in action:

4 Likes

Got everything working using a fresh build on a raspberry pi . Took a couple of attempts to get all the dependencies. The following is the minimum steps required after a fresh install of raspios-bullseye-arm64-lite.

sudo apt update
sudo apt upgrade
apt-get install python3-venv
curl -sSL https://get.docker.com | sh
sudo usermod -aG docker $USER
logout

Log back in and follow the steps carefully in the github . Adding the current user into the docker group , will ensure you will not need to add the sudo command.

I needed to add the “intent:” command into my configuration.yaml to get HA to action the commands.

Very impressed , once I got to grips with pronunciation of ESP , you need to emphasise the letters, my brummie accent struggled at first. Very quick , very accurate

For an alpha release , this is very encouraging , It has a touch screen, buttons, GPIO and looks like a commercial product. Looking forward to the ride :slight_smile:

After a few small hiccups (likely would have been avioded if I was better with Linux CLI commands) I got my ESPbox flashed. I will note that the repo has been updated since this to include help prompts to help get past the issues I had.

What I love about it is that it works with HA instantly and the speed at which voice commands actually change things in HA such as turning on a light is crazy fast.

I will say that so far I’ve had a few troubles with it not understanding my voice, but that is mostly solved by speaking a bit slower (it’s mostly the ‘hi ESP’ wake word). This could just be due to my Aussie accent vs it likely being taught on a US accent. I hope this can be improved though because it will get pretty annoying…fast. Google Homes understand me perfectly.

1 Like

Agree with the wake word, speaking slowly helps .

I was once told by some Americans, my accent sounded Australian :slight_smile:

I believe they will be holding a poll to add wake words and “Hi Willow” will be the default . Don’t suppose “Yow Allrite Kevin” will make it

2 Likes

I have both the esp32-s3-box & esp32-s3-box-lite which are both amazining demonstrators on what you can do on ‘just’ a microcontroller.

To be honest though its not very good because really its a demonstrator and Esspresif have crammed everything bar the kitchen sink in code on there.

Unlike Rhasspy it does have initial audio processing and is part of the Esspressif ADF Audio Front-end Framework - ESP32-S3 - — ESP-SR latest documentation

They are using BSS Blind Source Seperation (As do Google Voice Filter Lite) but unlike Googles targetted system the order of the seperated output is random (I presume) and its quite likely your not running just 1 KWS but 2 (1 on each channel to select the channel maybe).
Its a great bit of lateral thinking by esspresif, but the KWS is a very lite model its been absolutely quantised to death to squeeze it in the S3 possible load.

I have Brit accent and the ‘hi ESP’ just isn’t very consistant or as said above or very good.

If you jetison much and just run the ADF with a KWS then likely it could be much better, I have hunch the KWS model is very similar if not the same as a BC-Resnet as mentioned in google-research/kws_streaming at master · google-research/google-research · GitHub that will give you the documentation.
It was actually Snips who originally used a resnet model and there is a paper somewhere and quite a few years old now.

You could just create your own KWS with a custom keyword with tf4micro that likely would be much better as you garner much more resources, also its a shame esspresif has released everything as closed sourced blobs.
Because they are blobs you might be limited to how much you can expand on the ‘high_perf’ versions mentioned in the resource-consumption.

https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/benchmark/README.html#resource-occupancyesp32-1

[EDIT] I found the Snips paper https://arxiv.org/pdf/1811.07684.pdf

1 Like

Is it possible to recognize and use mapped custom commands instead of “turn of x” with the on-device speech recognition to use in HA? Like “feed Whiskers” or whatever?

I know speech recognition only supports English at the moment, but if the above is possible, would I be able to map words in a clever way so that it could recognize foreign language commands at this stage?

Would like to try this but living in a non-English speaking country, the rest of the family wouldn’t feel comfortable speaking English to the house.