Willow Voice Assistant

That feature will be introduced in v2023.7
https://www.reddit.com/r/homeassistant/comments/14m7pw0/ha_20237_is_introducing_service_calls_to_assist/

1 Like

Just giving this question a little bump


@kristiankielhofner any chance we can get a docker compose file for willow inference server?

My plan is to run it through docker so I dont have to allocate my entire GPU to the home assistant VM

So excited for this project!

EDIT: you can ignore my question, just read the github which states you guys will provide ready to deploy dockers :slight_smile: cant wait!

1 Like

Sorry, we’re pretty busy and I don’t get around to responding here as often as I would like.

“Sorry I didn’t understand” comes from Home Assistant so Willow is communicating with HA correctly. However, you’re not matching an intent.

HA has pretty limited processing abilities with intents. You need to make sure your speech command (and the resulting transcript) exactly matches what you have defined in HA. You can use the assistant in the web UI to confirm which commands work (match) and which don’t. Many people find they need to significantly expand the built-in intents, make sure entities are exposed to Assist, and often add aliases for them to match their intended grammar.

The HA team as part of the year of voice is making rapid progress on the ability to match commands. We’ll also be adding support for a natural language understanding/processing engine with intent recognition to pre-process commands before they are provided to Home Assistant so the speech to text output doesn’t need to perfectly match what is defined in HA.

8 Likes

How goes the development effort? Inquiring minds would appreciate an update. Thanks in advance!!

2 Likes

He just posted an update on youtube yesterday, some pretty cool stuff: https://www.youtube.com/watch?v=qlhSEeWJ4gs

3 Likes

It is that video, which I just noticed on the platform formerly known as Twitter, that brought me here in search of more information. I am old and the thought of talking to my appliances has never really pushed any buttons for me (and my S.O. is flat out opposed to Google/Alexa). But this is local and looks super interesting.

1 Like

Anyone have a step by step guide to setup an ESP32-S3-BOX using windows? Also are there any advantages of the newer model (ESP32-S3-BOX-3)?

Thanks

Hi, have you checked there: https://esphome.io/

https://www.espressif.com/sites/default/files/tools/flash_download_tool_3.9.5.zip
I think as been an age since I did.
Apart from new connectors and a dock its got 8mb added to psram.

I do have an ESP32-S3-BOX, installed using Windows. Unfortunately, not with a full setup guide. As I recall, I found some reasonable guides to install WSL2 & docker, then the Willow Application Server installed as documented. At that point, I could get to the WAS application from the host machine (http://172.17.39.174:8502, IP address from WSL command line: ip route get 1.1.1.1 | grep -oP 'src \K\S+' ). To make this available from the rest of the local network (and in particular the ESP32 box!), I needed to forward the port - from an admin command prompt:
netsh interface portproxy add v4tov4 listenport=8502 listenaddress=0.0.0.0 connectport=8502 connectaddress=172.17.39.174
And I had to open the port in Windows firewall.
I haven’t tried a local inference server yet (mostly because my graphics card is AMD rather than NVidia)

2 Likes

All, I just finished creating a full dashboard for our new Willow WIS servers. It’s near real-time and takes very little resource use. It’s fun to watch what happens on that server when sending STT through it.

I’m sure there are other objects that can be monitored, but this is a good start.

2 Likes

Thanks Jeff. I just got my WIS up and running. I’ll be playing with this now!

I think all this stuff couldn’t come at a better time. Amazon just laid off hundreds of folks from the Alexa voice division. I also noticed recently that all my Alexa devices “lock up” randomly. Give a command and the light spins for 30 seconds with no response. At first I thought it was just one, but they all do it. And it’s not network/internet related. Frankly, I won’t shed any tears when I finally unplug all my Alexa’s.

Yeah. I would not allow any voice devices that went to the ‘cloud’ in my house. Luckily, the wife agreed. The kid was less than happy.

But, now that he is in his own house (with his web voice crap) and notices that ads from Google tend to match discussions they have in their house, he is finally getting it.

2 Likes

That’s downright creepy. I refuse to use Google products. Pure Evil has met its match in that company.

3 Likes

BTW, one of the community members @kovram forked the Willow Auto Correct main branch and supports sending unrecognized commands off to Alexa. So, renaming your Alexa to Echo:

“Alexa turn on christmas lights” - HA turns on the lights
“Alexa set living room heater to 68F” - HA sets the thermostat
“Alexa, set a timer named Bread Rising for 2 hours” - HA has no clue how to do this so it gets forwarded to Alexa which sets a timer for 2 hrs
“Alexa how many ounces in a pound?” - again, HA has no clue, so Alexa answers.
“Alexa, play rock & roll radio on Pandora?” HA has no easy way to do this (no actual way on HA Supervised) on Pandora or Apple Music, so this gets sent to Amazon to play music

What I like (love) about this solution is Amazon no longer has ANY contact with my home devices and has no idea what I’m doing inside my home unless knowing what my named timers are doing counts. And a a bonus, Amazon Alexa can be MUTED since the command are sent to it programmatically. No more eavesdropping by nosy engineers at Amazon. And as time goes on, fewer and fewer commands will be forwarded to Amazon as more intents are created and refined in HA.

That Auto Correct w/ Alexa forwarding fork is here:

Its better than a proof of concept, it actually works really, really well. its is NOT something you should attempt if you aren’t at least moderately comfortable w/ Docker, Linux, etc.

1 Like

Would it be technically possible for willow to be built as an add-on in future for HAOS for those of us less technically inclined with docker etc?

It’s likely this will probably happen at some point, but that’s just a guess on my part. If it was done, it would only let you use their hosted, best efforts STT engine, not a local one. Better than nothing tho
 For useable, local STT, you need a PC with a GPU. A RPi is never (ever) going to give you local STT that is useable. It too compute intensive for a Pi or any other CPU for that matter.

1 Like

Actually that is totally untrue, in fact massively untrue especially with the Cortex A76 Rpi5 due to the Mat/Mul vector instructions that give a ML boost of over x6 of a Pi4.
Also the STT we have available running, are not the greatest in quality or efficiency and STT is constantly evolving in all aspects.

EfficientSpeech: An On-Device Text to Speech Model
EfficientSpeech, or ES for short, is an efficient neural text to speech (TTS) model. It generates mel spectrogram at a speed of 104 (mRTF) or 104 secs of speech per sec on an RPi4. Its tiny version has a footprint of just 266k parameters - about 1% only of modern day TTS such as MixerTTS. Generating 6 secs of speech consumes 90 MFLOPS only.

That is on a Pi and there are certainly models that without a GPU can produce great realtime voice on a Pi5.
However I don’t really advocate a Rpi5 as its far from efficient when RK3588(s) boards of similar price such as the OrangePi5 produce near x2 Gflops/watt and are more efficient than current Apple silicon. GitHub - geerlingguy/top500-benchmark: Automated Top500 benchmark for clusters or single nodes.

Initial attempts often made streaming versions of ASR & TTS to reduce latency, but this is problematic for multiuser systems as queues can quickly build up.
You can take a low cost RK3588(s) with much better idle wattage than a Rpi5 and employ models that are much faster than realtime.
Race-till-idle produces minimal latency and can serve multi-users even though clashes are minimised due to diversity of use.
You can scale by simply adding another metal instance that is fed from a routing queue that if the 1st instance is busy the next ‘Text’ is routed to a 2nd instance.

Everything is currently available from hardware to opensource software but what we have available running is likely what creates your incorrect assumptions.

The evolution of RK3588(s) has prob slowed due to the CHIPs act as China seems to be switching to RiscV and long term its prob going to hurt Arm more than China.
Everywhere in development from Big Data to China is creating ML ondevice capable silicon, but the worry is that the likes of Google are designing closed source models and accompaning application-specific integrated circuit, ML silicon that is such a massive leap from current opensource in terms of efficiency, speed and quality, it could be very hard for general purpose hardware and opensource to compete.