What's the best way to improve Whispers Speech-to-Text accuracy?

I’ve had very good luck with wake words, even in noisy environments, close to 100% accuracy with no false positives; but I’ve been finding that the voice commands are only understood by Whisper ~50% of the time in a quiet environment, and if the HVAC fan is on or a dishwasher is running then accuracy drops close to zero.

On top of the accuracy issue, if there is any background noise then the mic will stay on for 15-30 seconds or more, presumably intrepreting the background noise as continued input.

There’s two adjustable parameters for Whispers STT but neither of them seem to have any noticeable impact on accuracy, and there’s also no obvious way to adjust the threshold for what level of noise to ignore.

Model - I’ve tried running tiny, base, and small, as well as the compressed int8 versions of each. If anything the larger models seemed less accurate, and I even ran into a couple instances where it “heard” words that don’t exist. Like I would say “kitchen lights” and it would hear “gitchen lights”??

Beam size: I assumed this will allow the model more choice of text to fit the audio to, so I increased this to 2 to see if it made a difference, but haven’t noticed any increase in accuracy. Will further increases make any difference?

Since I didn’t see much difference when changing these, what do these two parameters actually adjust? What settings are others using that are working best? Are there other ways to adjust things in the software?

Replying to my own post to say that increasing the beam size actually made a huge difference, I couldn’t find anything in the documentation on what an appropriate value would be, but a couple other threads mentioned that 5 was the max value, so I increased it to 5. I’m using base-int8 as the model.

Accuracy in quiet environments increased dramatically, so far it hasn’t misheard a command. Hopefully no more hallucinating nonsense words.

Accuracy in noisy environments is still having some issues but noticably improved.

I haven’t noticed any drop in performance, but I’m running it on a Optiplex Micro with an i7-6700 and 16GB of RAM, so YMMV.

With Willow and our users in the real world for the past six months we believe that small beam size 2 should be considered the minimum for a usable voice assistant in the real world - part of that being noisy environments.

You may want to try small beam size 2 and see what your results look like.

After some experimenting, I settled for the medium-int8 model, with beam size 2. That seems to give me the best trade-off between speed and accuracy. Smaller models where very inaccurate, but the medium model was way too slow on the default beam size of 5. Changing that to 2 gives decent performance and still seems accurate in the limited tests I did.

For reference: I use the Dutch language (nl) and am running the model in a Docker container on my NAS, which has a 4-core Intel Celeron J4125 running at 2GHz.

3 Likes

Did you change something in the openWakeWord config? Or even the piper?
Im running on a AMD 5700G and is tooo slow and not even precise. Using Portuguese here.
My configs:



I will use this as starting point to start my tests on a celeron N100. Speaking French here… neighbor :wink:

Just for me to know, is there a module allowing us to improve models ? This can be done by manually listening to the recordings, and correct the model by typing what was actually said.

I think this would be helpful to allow HA community to improve the models for everyone, and enable quite a taskforce to this task.