What's the best way to improve Whispers Speech-to-Text accuracy?

I’ve had very good luck with wake words, even in noisy environments, close to 100% accuracy with no false positives; but I’ve been finding that the voice commands are only understood by Whisper ~50% of the time in a quiet environment, and if the HVAC fan is on or a dishwasher is running then accuracy drops close to zero.

On top of the accuracy issue, if there is any background noise then the mic will stay on for 15-30 seconds or more, presumably intrepreting the background noise as continued input.

There’s two adjustable parameters for Whispers STT but neither of them seem to have any noticeable impact on accuracy, and there’s also no obvious way to adjust the threshold for what level of noise to ignore.

Model - I’ve tried running tiny, base, and small, as well as the compressed int8 versions of each. If anything the larger models seemed less accurate, and I even ran into a couple instances where it “heard” words that don’t exist. Like I would say “kitchen lights” and it would hear “gitchen lights”??

Beam size: I assumed this will allow the model more choice of text to fit the audio to, so I increased this to 2 to see if it made a difference, but haven’t noticed any increase in accuracy. Will further increases make any difference?

Since I didn’t see much difference when changing these, what do these two parameters actually adjust? What settings are others using that are working best? Are there other ways to adjust things in the software?

Replying to my own post to say that increasing the beam size actually made a huge difference, I couldn’t find anything in the documentation on what an appropriate value would be, but a couple other threads mentioned that 5 was the max value, so I increased it to 5. I’m using base-int8 as the model.

Accuracy in quiet environments increased dramatically, so far it hasn’t misheard a command. Hopefully no more hallucinating nonsense words.

Accuracy in noisy environments is still having some issues but noticably improved.

I haven’t noticed any drop in performance, but I’m running it on a Optiplex Micro with an i7-6700 and 16GB of RAM, so YMMV.

1 Like

With Willow and our users in the real world for the past six months we believe that small beam size 2 should be considered the minimum for a usable voice assistant in the real world - part of that being noisy environments.

You may want to try small beam size 2 and see what your results look like.

After some experimenting, I settled for the medium-int8 model, with beam size 2. That seems to give me the best trade-off between speed and accuracy. Smaller models where very inaccurate, but the medium model was way too slow on the default beam size of 5. Changing that to 2 gives decent performance and still seems accurate in the limited tests I did.

For reference: I use the Dutch language (nl) and am running the model in a Docker container on my NAS, which has a 4-core Intel Celeron J4125 running at 2GHz.

4 Likes

Did you change something in the openWakeWord config? Or even the piper?
Im running on a AMD 5700G and is tooo slow and not even precise. Using Portuguese here.
My configs:



I will use this as starting point to start my tests on a celeron N100. Speaking French here… neighbor :wink:

Just for me to know, is there a module allowing us to improve models ? This can be done by manually listening to the recordings, and correct the model by typing what was actually said.

I think this would be helpful to allow HA community to improve the models for everyone, and enable quite a taskforce to this task.

I know that I am replying to an old blog but I agree with this one even in 2025. Although HA list the various model types along with a short description, as far as accuracy goes, they are not correct. I have gone through 3 Optiplex micro models just for HA voice. The i5 with 8GB memory was more than enough before but not even close with local voice control. HA describes using the largest model capable for you PC as it is more accurate but this is not true. I eventually bought an Optiplex 7020 Micro with a i7-14700 processor with 32GB memory which is very, very quick. It could handle the large models but even then the response is very slow, 7-8 seconds on the large V3 even at 55GHz turbo mode. Worse, it’s far WORSE handling the voice to words commands. It would make up completely oddball words not in the dictionary. That is fine but it also has no consistency in handling these. As this blog mentions, it is only 50%, maybe even less. Previously, I found that the int8 models gave the best performance. So, I went back to the largest int8 model which is medium int8. Night and day difference in accuracy and performance. Almost no delay and accurate translations. Kudos to whoever created these models and I hope a large int8 comes out one day.

3 Likes

Interesting to know there are inaccurate results even with the bigger models, thanks for sharing your experience!