I’m doing some extensive tests on Whisper and, as expected, I’m suffering very extensive performance/accuracy issues.
I run HA and the voice pipeline on a HP minipc, i5, 16GB of RAM, no dedicated GPU and the bare minimum to have an usable STT is the medium-int8 model, beam=5, but the performance are unacceptable. I’m trying to convince the family to toss out all the polluting alexa devices form home, but this means i need to have a (more or less) comparable level of performance.
I was wondering if there is some tuning I can do before throwing the towel. Vosk models are worth the try?
I’ve found other post talking about Vosk, i tried to built a temporary docker for it but performances were also not acceptable (maybe with some fine tuning could be better).
Take a look at Microsoft STT.
It is cloud based and free for low usage.
It is quite good in recognizing italian language.
In local I did not find anything acceptable.