Willow Voice Assistant

Everything is currently available from hardware to opensource software but what we have available running is likely what creates your incorrect assumptions.

Could be, but GPUs are much more suited for this task today. Looking around at the massive number of users here and on Discord who are hopelessly frustrated with their streaming nodes and use of faster whisper, Willow is a solution that works today, is incredibly accurate, employs auto correct, supports HA natively, passes unrecognized commands to Alexa, and is cheap and easy to implement, both from a hardware and software standpoint.

As a huge fan of HA, I sincerely hope the HA naitive solutions mature to the point of being useable like Willow is today, but the vast majority of Piā€™s and NUCs being used today simply wont cut it if folks expect local voice to work as a viable and reliable replacement for Alexa or Google devices. Iā€™m more than happy to be proven incorrect, but my career in electronics and audio processing indicates otherwise.

As Iā€™ve repeatedly stated, folks should look at the current HA voice stack as a fun proof of concept, not something they can deploy for everyday use. Unfortunately, most of the slick demo videos and such donā€™t make that clear, and frustration ensues. In fact, in the 8 years or so that Iā€™ve been using HA, Iā€™ve never seen so much frustration expressed by users trying to get something billed as a new feature, actually working. Myself included, and Iā€™m an advanced user. Of course I expect all of this to be streamlined and improve over time. History shows us this is the case more often than not, especially with the massive interest in voice as a core function of HA.

So, respectfully, I offer a look and comments on an alternative solution that more and more folks are finding quite useable. And one they can experiment with today that gives excellent results.

Jeff
Edit: In my previous post,I should have stated ā€œthe current crop of widely deployed Rpisā€ wonā€™t be good for local STT. The hamster wheel of hardware upgrades will undoubtedly change that in the future.

Totally agree with you and have been a bit dismayed at the shiny videoā€™s and claims that is causing users to go on a hardware spend trail to find actually things are much less effective than presented and could be considered a waste of money.

Here I totally dissagree as with technologies that are going through unprecedented evolution, its really shortsighted to embed anything native or specific.
If you take the initial topic of TTS basically all that is needed is to drop a json/yaml formatted file of MyZoneMyText where inotify can provide efficient notification.
We donā€™t need a HA native TTS conatiner what we need is a Linux common voice container that can easily house any of the vast range of existing and new developments in TTS.
They all have near the same criteria TTS text in and audio out and why that needs anything HA specific is purely opensource becoming more closedā€¦

Whisper is a huge LLM based ASR that uses context to gain accuracy and when you have short command sentences there is no advantage.
In fact WER of any Whisper model rockets as soon as you start using anything but the few common languages that it excels with large context. Analyzing Open AI's Whisper ASR Accuracy: Word Error Rates Across Languages and Model Sizes | Speechly

Again this is what we have been offered and it doesnā€™t mean better and more applicable solutions have and are available for a number of years, as they do exist already as github repos.

You rave about a system that is microcontroller based esp32-s3-box to a X86 gpu based system purely because of the sofware they have used because it was easy to appropriate refactor and rebrand as GitHub - toverainc/willow-inference-server: Open source, local, and self-hosted highly optimized language inference server supporting ASR/STT, TTS, and LLM across WebRTC, REST, and WS

ASR is the same as TTS and there should be no need for any branding apart from a Linux Speech containers as ASR is merely TTS in inverse where audio in is converted to text out.
The containers and queue and route system does not need to be branded or native apart from being a Linux system where any existing model or emerging can be inserted with minimal code needs.

Everytime a project appropriates, refactors and rebrands existing code into a community of a smaller herd, we lose so many benefits and currently its rife everywhere, to the detriment of a Linux voice framework, which is simply a series of containers with a network layer transport with routing queue. That is all that is needed, but near every dev has appropriated, refactored and rebranded to keep the I in thier own.

No disrespect intended, but quite frankly I (and Iā€™m sure 99% of all HA users) couldnā€™t care less if a solution is refactored, forked, or whatever. If it works and works well, is easy to implement and cost effective for the average user, and integrates into our existing HA ecosystem, and is supported by intelligent folks who care enough to continually improve the product, THAT is a win for everyone. Arguing against this basic premise is pedantic. Most of us ā€œJoe six pack usersā€ just want shit that works and works well. Afterall, it is the groundswell of this type of average user adoption that has propelled HA to the throne it currently enjoys.

ESPHome is a perfect example of this. Its use has absolutely skyrocketed in adoption. Sure you can try to implement ESP stuff solely in Arduino and C++ (which each have their place), but the rest of us ā€œsimple guysā€ prefer itā€™s ease of configuration, AND itā€™s usability.

So, want a voice solution that is best of class today for HA and just plain works without the frustration of the current HA stack? Try Willow.

As always, your mileage will vary considerably.

2 Likes

Yeah and when

This is another annoying thing as that is just fan based hyperbole as still when it comes to recognition accuracy the like of Google/Amazon rule the roost with the HA intergrations on offer but are not offline.
Its really strange and sad that so many are waving versions of technology like football scarves and just cheering for thier own team, when claims are far from true.

It would of been really great if Willow could of stayed focussed using the Esp32-S3-Box as a KWS device and maybe set a standard for how array microphone KWS should of been part of a Linux voice system as a new type of device.
Even expanded on that with a lower cost esp32-s3 of opensource hardware with I2S ADC and mics and created a inhouse KWS.

But anyway, enjoy yourself,

As an old guy, Iā€™ve always wanted to use this perfect response to posts like thisā€¦ but I had to wait until now.

MEHā€¦

Yeah I agree as until we do manage to pool into a bigger herd and community to share load and expertese, things will remain very meh with big data ruling the voice domain.
Unless some big funded benefactor such as Mozilla or Linux Foundation can dictate initial standards and solutions for Linux speech frameworks the gap between what we have and commercial is getting ever bigger.

Yes. The Willow Application Server has already been ported to an add-on: https://github.com/nwithan8/hassio-addons/tree/master/willow-application-server

The autocorrect and inference servers have not yet been ported.

1 Like

@nwithan8
Interesting, ok, thatā€™s cool. I tried adding itā€™s as a repository but it gave me an error. Is there a trick to it?
I used https://github.com/nwithan8/hassio-addons/tree/master/willow-application-server as the repository url.

Cmd('git') failed due to: exit code(128) cmdline: git clone -v --recursive --depth=1 --shallow-submodules -- https://github.com/nwithan8/hassio-addons/tree/master /data/addons/git/e9f4a7cf stderr: 'Cloning into 
'/data/addons/git/e9f4a7cf'... fatal: repository 
'https://github.com/nwithan8/hassio-addons/tree/master/' not found '

You would just add the whole reposiory:

https://github.com/nwithan8/hassio-addons

Out of curiousity, has anyone tried using willow with onju voice?

It uses an ESP32-S3 so Iā€™m guessing it should be easy to get it working.

Also, does anyone have any instructions for the inference server specifically for xtts? Iā€™m trying to deploy but Iā€™m having issues finding the appropriate values for all of the enviromentals

This a a good question to pose on their Discord channel.

Just did a quick search on their discord and it seems like @kristiankielhofner wants to focus support on the esp32-s3-box hardware for now :man_shrugging:

I think the firmware build is open source?

Definitely a few here have used onju voice. I would post here to see if you can find what you need. There is a voice-assistance channel that it has been mentioned in.

1 Like

Is there a way to use Willow in the HA voice assistant pipeline, so i could, for example, use extended openai conversation as the LLM?

In other words, Willow handles wakeword, and speech to txt/txt to speech. Ive found that openwakeword, whisper and piper to be pretty unusable.

You prob want to have a look at Willow https://heywillow.io/
GitHub - toverainc/willow: Open source, local, and self-hosted Amazon Echo/Google Home competitive Voice Assistant alternative

The box itself does have a chatgpt example esp-box/examples/chatgpt_demo at master Ā· espressif/esp-box Ā· GitHub

To say these are Google/Amazon competitive is extremely optimisticā€¦

Check out my cookbook on how to integrate Willow with Home Assistant:
How To install voice components (notion.site)

2 Likes

Thats a really good guide +1
As for Wyoming its a TWAIN and sort of really strange with the number of high performance C libs we have but hey all hail Python as the Alsa python wrappers are not great.

Its really strange Whisper is used as its a LLM and the more context you give it the more likely it will get.
Combine a LLM form of ASR and the tiny models many are running the WER (Word Error Rate) especially with short sentences is pretty terrible even when compared to older Wav2Vec and much smaller ASR models.
There is a difference between a conversational / translation ASR than what you would likely use for a command sentance ASR and that seems to be ignored for dev convenience.

Likely there are better options but often post wenet as there docs goes on to explain why.
LM for WeNet ā€” wenet documentation there are also n-gram transformers but with commands you only need a limited dictionary and lighter older tech as they say does work and better than current.
Whilst having relatively easy training frameworks whilst even finetuning Whisper no small matter.

Still though on the input audio processing what is avail opensource trails far behind and doesnā€™t do on device training to tweak model weights through use.

I never really got the Esp32-S3-box for use as its a great technology demonstrator dev kit but Ā£50 its not that great, with optimism.

Likely models will filter down but there is quite a time lag between SotA and opensource that is more years than months.

There is also tech on the marketplace that solely does the missing or poor initial audio processing, its expensive but for x2 S3-Boxes you can get something that works.

It works in a similar way to the Google ondevice algs but is single user profile by a recorded voiceprint, it extracts a known voiceprint than trying to cancel unknown noise. Doesnā€™t seem to create any audio artefacts that stops it working with Whisper.

Awesome, thank you! That answers my question. I wasnā€™t sure how Willow and HA connected as far as the voice pipeline.

Has anyone had experience successfully using TTS in a non-English language? I discovered an issue where, when selecting any language other than English, the TTS component tries to read words from another language as if they were in English. My TTS settings look like this: https://infer.tovera.io/api/tts?force_language=ru