Voice assistant faster-whisper slow and Ita model weak- Any support about?

vcarloni · February 13, 2024, 3:11pm

Hello everyone,
I am thrilled to learn about the ESP32-S3-BOX. I am currently conducting some experiments with an ESP32 I2S microphone and amplifier. I’ve encountered a few issues, but after numerous tests, I’ve determined that the bottleneck lies in the Whisper/faster-Whisper component.
I’ve read several other posts, but they have left me somewhat confused.
At present, I am using a NUC equipped with an impressive Intel N5105 and sufficient memory to run any container. However, it’s not fast enough to accelerate the faster whisper component.

I’ve conducted numerous tests with my Italian language and found a reasonable compromise when I use the medium-int8 Model with a beam-size of 3.
Under this configuration, many words are misinterpreted, rendering the voice assistant completely useless.

The main issue with this configuration is the processing time, which is around 3 to 5 seconds.

I am very interested in implementing numerous satellites like the ESP32-S3-BOX or using the Stream Assist HACS component https://github.com/AlexxIT/StreamAssist. However, I would sincerely prefer to avoid a home assistant server with an i7 CPU

I’ve been searching for some sort of approach on the web without any success. Apparently, Coral TPU can’t handle this https://coral.ai/models/audio-classification, I believe due to the size of the model.

So, my question essentially is: While Voice Assistant is fantastic, what are the hardware solutions that we could implement to handle voice assistant recognition in a robust and realistic scenario, considering many satellites that stream audio to the Home Assistant server?

Honestly, I have another consideration.
With faster whisper, do other languages seem to work well even with smaller models?
Could someone provide us with additional Italian models to test?

Thanks in advance

Vittorio

Stiltjack · February 13, 2024, 4:38pm

I’ve just set up an ESP32-S3-BOX as well, flashed with Willow. In HA I have the Willow Application Server add-on for speech recognition and the Amazon Polly integration for TTS, with responses played by the media player integration on Sonos speakers (no microphone needed) - the results are remarkably good, but cloud dependent of course.

vcarloni · February 14, 2024, 11:43am

Hi, thank you for sharing. I was looking into Willow and it seems to be a very impressive project. Unfortunately, building a Willow Inference Server requires a substantial hardware environment, mostly based on CUDA Nvidia, as described on Willow’s website. The Raspberry Pi 4 approach is below my current performance capabilities.

So, Willow appears to work with CUDA and it could be beneficial to find a low-cost NUC hardware with at least a GTX 1070, but that won’t be easy. Willow can’t run on my Jetson Nano because the memory required is too high.

If anyone has any ideas about suitable low-cost NUC hardware for Willow, please let me know. It would be greatly appreciated

will35 · February 14, 2024, 11:59am

Hello @vcarloni

You can try Vosk addon from synesthesiam , very accurate and speeder than Faster Whisper ( less than 2s on rpi4, 1s on i3 NUC) with french and spanish language ( support italian also…)
hassio-addons/vosk at master · rhasspy/hassio-addons (github.com)

Stiltjack · February 14, 2024, 12:30pm

True, but you don’t have to build your own. The project provide free access to their WIS - I’ve found response times to be comparable to Alexa.

Arh · February 14, 2024, 2:40pm

For anyone reading this, I have tried Voice Assist locally a few times using whisper, but gave up as the performance was too slow. I run HA in proxmox on an HP T630. I have just tried out this Vosk addon and the response has gone from about 10 seconds to almost instant. The only issue so far is that Vosk is not quite so good at recognising the words, but a bit of work with Aliases should help that along. Give it a try. All this in about 15 minutes of playing about.

will35 · February 14, 2024, 2:54pm

Hi @Arh
What is you’re speaking language ?
As explained in my post just above, Vosk is very accurate for French or Spanish ( not testing others)

Arh · February 14, 2024, 3:00pm

English. Its proving pretty good so far though. I would say a good 90% success.

Sorry I meant to say that in my post.

vcarloni · February 15, 2024, 8:22am

This is a very good alternative let me try I will report a soon as possible with Italian language speed result and accuracy. THX WIll35

vcarloni · February 15, 2024, 8:42am

Hi Stiltjack, I will test the Willow cloud too. I had a bit of a problem with Google Assistant services in the past and I dream of the day when I could have an efficient system entirely local. But you are right, if I don’t find a proper NUC with a good price suitable for my purpose, the cloud could actually be the only solution.

Thanks for your advice

gt4020 · March 10, 2024, 9:07am

Hi, i tried vosk on Ha installed on VM on a Synology Ds220+. It works much better than Whisper in Italian, faster and don’t fail so many words. Is there also a better alternative to Piper regarding TTS engine?

will35 · March 10, 2024, 3:23pm

Perhaps, don’t have utility of offline tts alternative. Piper is good enough for me

CrazYoshi · August 28, 2024, 2:56pm

Hi,
did you performed a test on Vosk? Is it working fine on italian language?

will35 · August 28, 2024, 3:25pm

I’m French

Read the post just above :
Hi, i tried vosk on Ha installed on VM on a Synology Ds220+. It works much better than Whisper in Italian, faster and don’t fail so many words

FelixGaebler · November 6, 2024, 10:38pm

I just installed vosk and went from 19,2s to 0,02s (no joke). Insane for german so far

razorbac · January 4, 2025, 11:30pm

I installed Vosk with Italian model, and it’s waaaay faster than Whisper,but i can’t understand why it add the letter “e” before almost every sentences, making it fail. Is anyone facing the same problem?

Jokerigno · January 19, 2025, 1:02pm

Same issue and basically is always unusable. Does anyone find any solution?

razorbac · January 20, 2025, 9:59am

not yet, not here nor in reddit sub =( i now moved to the free trial month of nabu, and i have to admit it works quit well (as long there is nobody else around or TV turned ON speaking in italian)

albertoarmida · January 24, 2025, 8:08pm

Same problem here.
To temporarily work around it (while waiting for it to be permanently resolved), in the “custom_sentences/it/” folder I created a “_common.yaml” file in which I added the words “a” and “e” as skip_words.
This way, even if they continue to be prefixed in the recognition, they are ignored (after all, they are conjunctions) and now it works much better.

maglat · January 26, 2025, 4:34pm

As soon you have decent enough hardware to run whsipers large model, it works very good. I am using whsiper with large model and german language on a Mac Mini M4. Works fast and accurate. All Models below large gave a very disappointing experience.