Hello everyone,
I am thrilled to learn about the ESP32-S3-BOX. I am currently conducting some experiments with an ESP32 I2S microphone and amplifier. I’ve encountered a few issues, but after numerous tests, I’ve determined that the bottleneck lies in the Whisper/faster-Whisper component.
I’ve read several other posts, but they have left me somewhat confused.
At present, I am using a NUC equipped with an impressive Intel N5105 and sufficient memory to run any container. However, it’s not fast enough to accelerate the faster whisper component.
I’ve conducted numerous tests with my Italian language and found a reasonable compromise when I use the medium-int8 Model with a beam-size of 3.
Under this configuration, many words are misinterpreted, rendering the voice assistant completely useless.
The main issue with this configuration is the processing time, which is around 3 to 5 seconds.
I am very interested in implementing numerous satellites like the ESP32-S3-BOX or using the Stream Assist HACS component https://github.com/AlexxIT/StreamAssist. However, I would sincerely prefer to avoid a home assistant server with an i7 CPU
I’ve been searching for some sort of approach on the web without any success. Apparently, Coral TPU can’t handle this https://coral.ai/models/audio-classification, I believe due to the size of the model.
So, my question essentially is: While Voice Assistant is fantastic, what are the hardware solutions that we could implement to handle voice assistant recognition in a robust and realistic scenario, considering many satellites that stream audio to the Home Assistant server?
Honestly, I have another consideration.
With faster whisper, do other languages seem to work well even with smaller models?
Could someone provide us with additional Italian models to test?
I’ve just set up an ESP32-S3-BOX as well, flashed with Willow. In HA I have the Willow Application Server add-on for speech recognition and the Amazon Polly integration for TTS, with responses played by the media player integration on Sonos speakers (no microphone needed) - the results are remarkably good, but cloud dependent of course.
Hi, thank you for sharing. I was looking into Willow and it seems to be a very impressive project. Unfortunately, building a Willow Inference Server requires a substantial hardware environment, mostly based on CUDA Nvidia, as described on Willow’s website. The Raspberry Pi 4 approach is below my current performance capabilities.
So, Willow appears to work with CUDA and it could be beneficial to find a low-cost NUC hardware with at least a GTX 1070, but that won’t be easy. Willow can’t run on my Jetson Nano because the memory required is too high.
If anyone has any ideas about suitable low-cost NUC hardware for Willow, please let me know. It would be greatly appreciated
For anyone reading this, I have tried Voice Assist locally a few times using whisper, but gave up as the performance was too slow. I run HA in proxmox on an HP T630. I have just tried out this Vosk addon and the response has gone from about 10 seconds to almost instant. The only issue so far is that Vosk is not quite so good at recognising the words, but a bit of work with Aliases should help that along. Give it a try. All this in about 15 minutes of playing about.
Hi Stiltjack, I will test the Willow cloud too. I had a bit of a problem with Google Assistant services in the past and I dream of the day when I could have an efficient system entirely local. But you are right, if I don’t find a proper NUC with a good price suitable for my purpose, the cloud could actually be the only solution.
Hi, i tried vosk on Ha installed on VM on a Synology Ds220+. It works much better than Whisper in Italian, faster and don’t fail so many words. Is there also a better alternative to Piper regarding TTS engine?
Read the post just above :
Hi, i tried vosk on Ha installed on VM on a Synology Ds220+. It works much better than Whisper in Italian, faster and don’t fail so many words