Some good news for you - we have done terrible and awful things to ESP Boxes in the course of development for Willow! We have yet to brick a single device (and the team has at least 15 of them). They are EXTREMELY resilient to flash failures, etc.
The Mycroft story is a frustrating one. Over the course of my 20+ year career in this field I tend to see the same mistakes being made over and over again. I say mistakes because I’ve made all of them and more myself!
They took on WAY too many hard things WAY too early (and all at once), and the story and final result is a reflection of that. I’ve seen these movies before and they never end well.
The ESP BOX always uses local processing for wake word and AFE/DSP. However, speech recognition runs in one of two user configurable modes:
-
Local. When local command recognition is selected, we pull the friendly names of entities from Home Assistant and dynamically build the grammar required by the on device Multinet 6 speech recognition module. It supports a maximum of 400 commands (currently) but in terms of the hardware, model, DSP, etc this isn’t necessarily a hard limitation although we heavily enforce it now because that is what we have tested with. In this mode speech/audio never leaves the ESP BOX itself.
-
Willow Inference Server. Next week we will be releasing our highly optimized WIS implementation so users can self-host. This is what powers the best-effort Tovera hosted speech recognition server we provide by default. In this mode, as soon as wake is activated we being to stream audio (after DSP processing) directly to WIS in real-time. When voice activity detection detects end of speech we send a small end marker and WIS takes the buffered audio and performs speech recognition, with the results sent to the device. This enables extremely low latency, high performance, and extremely accurate speech recognition of any speech across more than 40 languages. We provide the detected language to Home Assistant so you can walk up to Willow and provide speech in any of these 40 languages (without extra configuration or prior knowledge) and it will send the output to Home Assistant with detected ISO language code, complete with UTF-8 encoding of various character sets.
In both cases the speech transcript is sent to the Home Assistant pipeline or conversation API over Web Sockets or HTTP REST (with or without TLS) depending on the version of Home Assistant, component, transport, etc we detect.
So, with Willow and WIS you can say things like “Put an entry on my calendar for lunch with Josh at 2pm on Wednesday May 22nd at Napoli’s Pizza in Chicago, Illinois” and as long as your HA intents can process it you’re good to go.
Oh wow this got way more action than what I expected!!
Very interesting! I’ll definitely be checking out Willow!
Thanks for the info - the mix of local DSP sound processing for wake word and clarity, backed by a much more powerful STT and intents engine makes a lot of sense. It reduces the cost of the front end devices, and allows many to connect back to one central server resource.
Personally, my preference is for a local only architecture without the need for cloud services - but obs. that accepts the need to self host. What will be interesting is if the “back end” can be optimised down to something like an i5-class commodity desktop or micro-server, rather than a full i7-class gaming rig with a high-power GPU (don’t know Team Red as well - sorry).
The cloud service voice platforms must really have to optimise their pipelines to remove latency, as just moving packets up and down across the WAN to the and cloud must add many tens of mS putting the architecture behind self-hosting.
Local mode is all local on the ESP BOX itself.
Server mode uses our Willow Inference Server and we will be releasing it next week. We didn’t want to release both simultaneously because we are a small team and the response from Willow alone has been very overwhelming and we’re struggling to keep up with incoming as it is.
The Willow Inference Server is for self-hosting in server mode and you can put it anywhere. There is one “gotcha” of sorts. Our goal is to be the best voice user interface in the world and beat commercial offerings in every way possible. To do very high quality speech recognition with the sub 1s latency we target today that means GPU. I can assure you Amazon isn’t using CPUs for Alexa!
As an example, the most highly optimized CPU-only Whisper implementation is whisper.cpp. You can use it on the fastest CPU on the market and a $100 six year old GTX 1060 or Tesla P4 beats the pants off it - at a fraction of the cost AND power. GPUs are very different in terms of fundamental architecture and are significantly more well suited to tasks like speech recognition.
The Willow Inference Server can run CPU only but for Alexa quality and user experience speech recognition you will be waiting a long time for text output, and the benefits of a locally hosted high quality voice interface diminish considerably when you’re waiting three, five, or even 10 seconds or more for a response. You could try using the lighter models we offer (base) but the quality will be significantly lower - or it may work just fine for your purposes. It will still be very “slow”.
Here are some early benchmarks for the (highly optimized) Willow Inference Server across various GPUs:
Device | Model | Beam Size | Speech Duration (ms) | Inference Time (ms) | Realtime Multiple |
---|---|---|---|---|---|
RTX 4090 | large-v2 | 5 | 3840 | 140 | 27x |
H100 | large-v2 | 5 | 3840 | 294 | 12x |
H100 | large-v2 | 5 | 10688 | 519 | 20x |
H100 | large-v2 | 5 | 29248 | 1223 | 23x |
GTX 1060 | large-v2 | 5 | 3840 | 1114 | 3x |
Tesla P4 | large-v2 | 5 | 3840 | 1099 | 3x |
RTX 4090 | medium | 1 | 3840 | 84 | 45x |
GTX 1060 | medium | 1 | 3840 | 588 | 6x |
Tesla P4 | medium | 1 | 3840 | 586 | 6x |
RTX 4090 | medium | 1 | 29248 | 377 | 77x |
GTX 1060 | medium | 1 | 29248 | 1612 | 18x |
Tesla P4 | medium | 1 | 29248 | 1730 | 16x |
RTX 4090 | base | 1 | 180000 | 277 | 648x (not a typo) |
So you can see from this a Tesla P4 can do 3.8 seconds of speech with the > 40 language Whisper medium model in 586ms - whether self-hosted locally or over the internet that easily meets our < 1s latency target.
Why not TPU? I got the dual core coral m2 thing and currently only using one core for frigate so would love to have something that I can use that second core for.
For something like the coral m2 I find them interesting, and I’m not necessarily opposed to supporting them eventually, but it’s important to understand just how computationally demanding extremely high quality speech recognition in sub one second response times is.
We have heavily optimized our inference server implementation and quantized models to 8 bit. The > 6 year old Tesla P4 can do 22 TOPS in int8 (an RTX 4090 is 145 TOPS). The coral m2 specs claim a maximum of 4 TOPS, but the models and ecosystem are nowhere near as optimized or efficient as they are in something like CUDA land so I’m sure an apples-to-apples comparison is impossible. I’d be surprised if in an apples-to-apples comparison the coral m2 came out to be half what they claim in real world for this application.
Long story short, we’re not focusing on devices like the coral m2 because they fundamentally can’t provide the kind of user experience we’re designing for.
Ah ok that’s fair enough.
Is there a resource for checking TOPS? I have a jetson that’s collecting dust and wouldn’t mind giving it a job.
Nano?
Sadly they make the coral m2 look great. They’re rated at 472 GFLOPS, which is… Not good:
HAHA yes, boo well that’s no fun.
I played with it for about a month and retired it pretty quickly.
I still have my Nano but it’s been collecting dust for a while
I was able to use Porcupine to detect a wake word and send the audio to a voice assistant pipeline on a raspberry pi 4 with a USB mic.
Bit off topic but saw this tinyml project on Arduino Nano which listens for dog barks which is pretty cool.
Hi, haven’t tried the project in this link but from the youtube video it seems to work very well. Would be great if we could integrate this into esphome somehow. The link below is a demo project of a DIY Alexa
For someone who wants to implement something: tensorflow micro has a keyword spotting example that works very well. Also espressif’s audio framework (esp-adf) can do that. The adf isn’t open source, though.
Regarding the resources needed: Some people say it’s barely doable: I tested tensorflow micro and i think it uses a fraction of the esp32’s computing power. Let alone the second cpu core. It basically consists of a small neural network. And i believe it’s precise enough. (However i haven’t had any decent microphone to do a proper test)
Congratulations. After physically downsizing ALL the computers in my home, I’m buying a full-size motherboard, processor, case, and power supply for the first time in years. Not complaining, just noting the significance.
We’ll likely test a tf-micro implementation for wake word. The ESP BOX uses the ESP32 S3 which is significantly more capable than the ESP32 so we’ll have even more headroom.
Glad to hear!
We’re well aware that our emphasis on GPU for and/or higher performance CPU for inference is troubling to many users.
However, we’ve noticed that many people (sounds like you’re one of them) are using self-hosted Willow as an opportunity to re-asses their hardware configurations. Once “bitten by the bug” that is HA, Frigate, Plex, and who knows what a lot of people end up with a less than ideal random collection of Raspberry Pis, NUCs, etc they’ve assembled over the years. Combining them on a single (larger) machine almost always results in a significantly better experience all around while using equivalent power (or less) at actually lower cost assuming reasonable resale value and/or productive use of the replaced hardware for other tasks.
Recently I have begun testing the ODROID N2+ and saw the boards contain the Mali-G52 GPU.
Looking at your list of GPUs they appear to all be NVIDIA. Any plans to support others?
I expect performance to be lower but better than CPU alone.