Whisper very slow on first access

Is this (that is, the slow speed shown to translate a three word sentence into text) normal in faster-whisper?

I’m running on an X86 host that doesn’t seem to be stretched and responds much faster on subsequent requests. Assist is being called from a dashboard on my phone, so obviously the host has just rendered the web page (and quickly). I had hoped the system would be faster than Google but on each first use, it’s much slower, to the point where I keep thinking the operation will fail - almost as if the system has gone to sleep. As you see, the total time here to “Done” is about 23 seconds. I also often get “No text recognised” on first access, although I’m speaking in the same way and at the same distance from the microphone as on repetitions where the system does respond correctly.

2 Likes

Hello, have you already found a solution?

I also have an x86 with HA and voice recognition doesn’t work well for me either. I set whisper to small-int8. Every other time he doesn’t understand me and when he does understand me, it takes about 20 seconds until an action is triggered. (with tiny-int he doesn’t understand the words, with the better models the system doesn’t respond)

(testet with Android HA OnePlus 9pro)

Greetings
Dirk

Sorry Dirk. Disappointingly, I still haven’t found an answer to this issue or had any response other than yours.

What is your hardware?

Speech recognition and Whisper (even faster-whisper) utilizes a large number of highly parallel operations. Unfortunately for the voice assistant use case the timing and performance you are experiencing is about right for CPU.

As an example, three seconds of speech on a Threadripper PRO 5955WX (bare metal, otherwise idle) using base-int8 with faster-whisper takes 245ms. That may seem fast but needless to say this hardware is ridiculous and not something most people have (the CPU alone sells for $1K - used).

When I test with a less ridiculous but still very capable CPU (11th Gen Core-i7) base still takes several seconds. Base also isn’t terribly accurate, I find that small or medium with beam size 2 is a minimum for most voice assistant tasks and commands. Tiny and base can work but you have to have very high quality audio and speak very, very clearly and intently - which almost never happens with casual voice assistant use cases. Even then the error rates are still very high. Of course these models take significantly longer to run with my ThreadRipper above taking 641ms for small and 1614ms for medium.

Timely speech recognition for voice assistant tasks really needs to run on GPU. As a comparison to the scenario above, a seven year old Nvidia GTX 1070 ($100 used) does three seconds of speech using base-int8 in 70ms and 588ms with medium. If you’re looking for high quality and reliable speech recognition and don’t want to wait 5-10 seconds (or more) CPU just won’t do it. Even a lowly GTX 1070 has almost 2,000 cores and more than 250 GB/s of memory bandwidth which alone is 5x faster than the fastest DDR5 memory available. Almost everything in the ML/AI space also primarily targets Nvidia CUDA which is where the vast majority of the software optimization work has been done.

Note that these are all “floor” benchmarks with audio provided directly to the models - this doesn’t include any network latency, execution time for commands, TTS, or overhead due to VAD/Wake Word/etc. So you can start to see how things can really start to add up to the point where many people will see voice command times in the tens of seconds.

Cloud providers (even Nabu Casa via Azure) all use GPU/TPU because it’s the only way to do high quality and high performance speech recognition.

2 Likes

My hardware is not the most powerful - it’s a Geekom MiniAir 11 with an Intel Celeron N5095 (described as “4 Cores, 4 Threads, 4M Cache, up to 2.90 GHz” - I’m guessing that’s an overclocked speed, not the one it’s running at under HassOS). Before I installed HassOS it was able to run Windows 11 at an acceptable speed, so it’s definitely a step up from the Raspberry Pi. I hadn’t realised that speech recognition was quite so demanding on hardware.

If that’s coming from Intel ARK it’s probably a real-world number without overclocking but the thermals make the situation tough. While it can run HA very well and even Windows, it’s extremely underpowered for speech recognition. faster-whisper with a speech recognition model that even gets close to “working” will completely peak those cores for long enough to run into thermal throttling and down-clock from the 2.90 GHz peak/boost speed, resulting in the many, many second response times people are seeing. Even at boost it’s nowhere near powerful enough.

I’ve been doing this for years so I know this all too well. However, as HA makes progress on voice it’s becoming clearer and clearer most HA users (very understandably) don’t know this, and many of them seem to be as surprised as you are - which again is completely understandable. High quality speech recognition is a completely different animal compared to everything and anything else HA is doing.

I try to participate and help out where I can in the HA community without shilling my own project, but when it comes up I want to be very transparent. I’m the founder of Willow.

In the six months since our release we’ve had a lot of skepticism and pushback on our GPU emphasis to completely local voice. I’ve spent a lot of time trying to explain the fundamentals of these things to understandably skeptical users. From what I’m seeing in terms of community feedback with the HA CPU emphasized approach many users seem to be very underwhelmed and disappointed by the experience with faster-whisper on their hardware. This is in no way the fault of HA, it’s just the reality and fundamentals of speech recognition.

I like to say that speech recognition is fundamentally a GPU thing, and when you bring a CPU to a GPU fight for these kinds of applications you’re going to lose - bad. We’ve seen a trend with Willow users - they start with CPU, observe the very poor experience, go out and buy a cheap GPU, and become very happy. I suspect the same thing will happen with HA and there are already a lot of people getting the HA faster-whisper approach running on GPU.

We have a very highly optimized Whisper implementation of our own and you can see the numbers yourself on our benchmarks page. That said, CPU “is what it is” and even our CPU numbers are essentially impractical.

3 Likes

Any idea id Coral Accelerators would help, or is anyone is doing it?
They have a Mini PCIe and M.2 versions that a lot of people are using for facial recognition?

I just picked on you because you seem to be very knowledgeable and into it.

So, Speech on HA is nice, but you will have to buy new hardware. Which will be using much more power…
So the Google assistant is still the great working option for voice, but sadly not the local way…
I think it’s a missed opportunity for HA. Maybe they should warn users for the high needed spec’s to be able to use voice…

Btw thanks for clearing my search for a solution for the very slowly speech conversion.