Intel GPU Accelerated Speech to Text using whisper.cpp in Docker

tannisroot · February 21, 2025, 10:59am

Like many people I have a home server with the cheapo Intel Arc A380 for Jellyfin transcoding that otherwise does nothing, so I whipped up a docker compose to run GPU-accelerated speech-to-text using whisper.cpp.
Initial request will take some time but after that, on my A380, short requests in English like “Turn off kitchen lights” get processed in ~1 second using the large-v2 Whisper model.
speech-to-phrase can be better if you are using only the default conversation agent, but this could be useful when paired together with LLMs, especially local ones in Prefer handling commands locally mode.
I imagine something like B580 should be able to run this and a model like llama3.1 or qwen2.5 at the same time (using the ipex image).

tannisroot · February 24, 2025, 3:31am

Scratch that, I haven’t used whisper.cpp in a while (due to using speech-to-phrase and the rhasspy predecessor) but in the months that I haven’t the accuracy with the same large-v2 model improved drastically (based on my recent usage), it even outperforms speech-to-phrasewith telling “turn off” and “turn on” apart and in noisy situations.

sayanova · February 24, 2025, 3:38am

Hi, will this work with Intel iGPU?

tannisroot · February 24, 2025, 3:47am

I haven’t tested it due to lack of an Intel gpu but whisper.cpp lists iGPUs as supported so I don’t see why it wouldn’t.
One pitfall I’ve just noticed though is that you might want to map whole /dev/dri in case the system has multiple GPUs, as well as checking the group the device belongs to because it might not be 107 like on my end (i’ll add these as notes to the repo soon).

Pkkrusty · February 24, 2025, 6:06am

And have you tried large-v3 or large-v3-turbo? Curious how those compare accuracy-wise, speed-wise to large-v2

tannisroot · February 24, 2025, 6:49am

Speed wise they are the same, but in older versions of whisper.cpp I remember large-v3 hallucinating a lot more than large-v2.
However since whisper.cpp is now a lot accurate in general, more testing is needed.
large-v3-turbo is substantially faster, by 50-40%, on A380 simple requests take around 0.7 seconds.
Even if it’s less accurate, it might be possible to workaround it with the initial prompt option, I’ve exposed it in the script, will report how it works with large-v3-turbo.

Pkkrusty · February 24, 2025, 8:24am

Initial prompt option: Talk about that. What do you pass in to increase accuracy? There doesn’t seem to be much detail out there on optimizing that bit.

tannisroot · February 24, 2025, 2:02pm

There is an example on github of what you can pass.
Whisper is in the same transformer family as LLMs, so basically just shove a bunch of words you commonly use when talking with Assist, maybe examples of what you expect the output to be, you can experiment what works best.

DudeShemesh · February 25, 2025, 5:21pm

I tried this on a gen7 igpu and got:

whisper-cpp | terminate called after throwing an instance of ‘std::runtime_error’
whisper-cpp | what(): can not find preferred GPU platform

I tried setting LIBVA_DRIVER_NAME=i965 in the container, but it didn’t help.

tannisroot · February 26, 2025, 4:10am

According to requirements for iGPUs https://www.intel.com/content/www/us/en/developer/articles/system-requirements/intel-oneapi-base-toolkit-system-requirements.html
only 11th gen and newer are supported (

DudeShemesh · February 26, 2025, 9:23am

Bummer. In that case, I wish it gets wrapped as an add-on, so I can run it alongside my HA

tannisroot · February 26, 2025, 9:43am

I have plans for that. I also want to try creating a similar container but using Vulkan backend instead of SYCL, that should work on even older iGPUs.

dzmiller · March 6, 2025, 5:13pm

I’ve read that faster-whisper, which is the standard integration, is CPU optimized. If true perhaps a CPU like the Intel 355 would be a better choice than pursuing GPU.

My older i3 NUC is too slow with Whisper. Can thread number be set for Whisper in an HAOS installation?

flyize · April 12, 2025, 6:03pm

Any chance someone could ELI5 this for me to run it on Unraid?

flyize · April 28, 2025, 12:10am

Well I think I’ve figured it out, but it doesn’t seem to work on my 12th gen Alder Lake.

rachel1 · October 30, 2025, 2:18pm

Hey, it has been a few months!

How did this end up working out @tannisroot ?

I am tempted to spin it up on a 14th gen iGPU and see if it can run the distil-whisper-medium.en model since that is what I have running on my Nvidia Tesla P4. It would be great to jettison that card and run iGPU only.