Like many people I have a home server with the cheapo Intel Arc A380 for Jellyfin transcoding that otherwise does nothing, so I whipped up a docker compose to run GPU-accelerated speech-to-text using whisper.cpp.
Initial request will take some time but after that, on my A380, short requests in English like “Turn off kitchen lights” get processed in ~1 second using the large-v2
Whisper model.
speech-to-phrase
can be better if you are using only the default conversation agent, but this could be useful when paired together with LLMs, especially local ones in Prefer handling commands locally
mode.
I imagine something like B580 should be able to run this and a model like llama3.1
or qwen2.5
at the same time (using the ipex image).
Scratch that, I haven’t used whisper.cpp
in a while (due to using speech-to-phrase and the rhasspy predecessor) but in the months that I haven’t the accuracy with the same large-v2
model improved drastically (based on my recent usage), it even outperforms speech-to-phrase
with telling “turn off” and “turn on” apart and in noisy situations.
Hi, will this work with Intel iGPU?
I haven’t tested it due to lack of an Intel gpu but whisper.cpp lists iGPUs as supported so I don’t see why it wouldn’t.
One pitfall I’ve just noticed though is that you might want to map whole /dev/dri in case the system has multiple GPUs, as well as checking the group the device belongs to because it might not be 107
like on my end (i’ll add these as notes to the repo soon).
And have you tried large-v3 or large-v3-turbo? Curious how those compare accuracy-wise, speed-wise to large-v2
Speed wise they are the same, but in older versions of whisper.cpp I remember large-v3 hallucinating a lot more than large-v2.
However since whisper.cpp is now a lot accurate in general, more testing is needed.
large-v3-turbo
is substantially faster, by 50-40%, on A380 simple requests take around 0.7 seconds.
Even if it’s less accurate, it might be possible to workaround it with the initial prompt option, I’ve exposed it in the script, will report how it works with large-v3-turbo
.
Initial prompt option: Talk about that. What do you pass in to increase accuracy? There doesn’t seem to be much detail out there on optimizing that bit.
There is an example on github of what you can pass.
Whisper is in the same transformer family as LLMs, so basically just shove a bunch of words you commonly use when talking with Assist, maybe examples of what you expect the output to be, you can experiment what works best.
I tried this on a gen7 igpu and got:
whisper-cpp | terminate called after throwing an instance of ‘std::runtime_error’
whisper-cpp | what(): can not find preferred GPU platform
I tried setting LIBVA_DRIVER_NAME=i965
in the container, but it didn’t help.
According to requirements for iGPUs https://www.intel.com/content/www/us/en/developer/articles/system-requirements/intel-oneapi-base-toolkit-system-requirements.html
only 11th gen and newer are supported (
Bummer. In that case, I wish it gets wrapped as an add-on, so I can run it alongside my HA
I have plans for that. I also want to try creating a similar container but using Vulkan backend instead of SYCL, that should work on even older iGPUs.
I’ve read that faster-whisper, which is the standard integration, is CPU optimized. If true perhaps a CPU like the Intel 355 would be a better choice than pursuing GPU.
My older i3 NUC is too slow with Whisper. Can thread number be set for Whisper in an HAOS installation?