Improve whisper performance on intel hardware

cibernox · March 5, 2024, 9:58am

I’ve been dipping my toes in the voice control waters to check if its possible to replace Alexa with assist for home control. I have an onju home board that I added to a google nest mini and I have it working.

However, the speech to text recognition is pretty bad in Spanish. Even speaking very clearly in an ambient with no noise, accuracy ranges from absolutely useless from the tiny and base models, to a hit or miss using the small_int8 model. And I promise you I’m actually a clear speaker without any strong accent.

It’s clear that non-english speakers need to use bigger models than small, but even on my fairly decent home server (12gen i3 with 10 cores and 32gb of ram, running proxmox with HA and a few other apps) the small_int8 is really the biggest model one can use, as the medium takes 6-7 seconds to respond to a command.

Has anyone succeeded in running whisper, whisper.cpp, faster_whisper or any whisper mod leveraging the integrated GPU in modern intel hardware???

The integrated Xe GPU in the 12th/13th gen intel processors and above uses the same ARC architecture than the dedicated intel ARC GPUs, and I’ve seen that intel published some libraries to accelerate inference using their ARC cards, so I’m inclined to think that it should be possible, but I don’t know enough about pytorch and AI to even begin to investigate.
These Iris Xe iGPUs are moderately capable too, on par with the Radeon Vega 8 in AMD 4000 APUs or order mid-tier discrete graphic cards like the GTX 860M.

Running home assistant on intel NUCs or other repurposed hardware with intel CPUs that have integrated graphics is fairly common, so if this was possible, a lot if people would benefit from it.

Even more so once we attempt to also generate responses using a small LLM.

UPDATE:
Just to be sure I started saving my voice recording using these lines in the configuration.yaml:

assist_pipeline:
  # Store audio recordings for debugging/training purposes
  debug_recording_dir: /config/www/assist_pipeline/

I wanted to be sure the results weren’t bad because of audio quality issues but that’s not the reason, the audio samples I get are pretty decent, with clear voice and insignificant background noise.

will35 · March 5, 2024, 10:49am

Hello

Try Vosk Addon , very speed and accurate for supported non english language

cibernox · March 5, 2024, 11:18am

I did, and it’s indeed very fast. But I found accuracy to be…weird. Specifically for smart home related sentences is very good, but nowhere nearly as good for other sentences like “How much time is left for the washing machine?”.

Also, I found that sometimes it’s too eager in responding to commands. So much so it doesn’t wait until I’ve finished talking, so when listening to a sentence like “Turn on the lights on the kitchen”, as soon as it hears “Turn on the lights” it doesn’t continue listening the “in the kitchen” part.

But I agree that speed-wise is amazing. It’s so fast that sometimes the lights are turned on before I’ve closed my mouth from speaking. Like 1/10th of a second.

mchk · March 5, 2024, 11:48am

If you manage to use igpu acceleration in whisper.cpp then pay attention to this project GitHub - ser/wyoming-whisper-api-client: Wyoming protocol server for the Whisper API speech to text system
Perhaps there are some other recognition implementations with a suitable api that can be connected to the Wyoming protocol

will35 · March 5, 2024, 11:58am

And with more larger model ?

VOSK Models (alphacephei.com)

cibernox · March 5, 2024, 12:18pm

Actually, I have not. I’ll give it a go. The difference in size is tremendous (39M - > 1.4GB).

cibernox · March 5, 2024, 1:58pm

I tried the bigger model and accuracy better indeed. Nevertheless, I think that being able to run whisper leveraging GPU acceleration is a good thing. We don’t know how things are going to evolve and maybe whisper keeps improving while vosk stagnates.

It’s good to try to leverage the hardware we already have.

hannemann · March 7, 2024, 10:45pm

I know this does not answer your question but maybe it’s worth thinking about it.

Or make use of good old used hardware that can be bought cheaply.
I’ve thrown in a GTX 1050 TI using a “PCI-E 1X USB 3.0 riser card” since my server case is way to small (Amazon.de).

The response time is quite similar to echo devices using the german medium-int8 model.

I thought installing CUDA and stuff would be a hard task but it’s really not a big problem. I had the whole wyoming stuff already running on another machine (i run HA on a underpowered used thin client) using docker compose. Instructions how to do that should be easy to find…
The only thing i had to do was to install the gpu drivers and nvidia container toolkit to the docker host and change the whisper image to a CUDA supporting one.

If you consider trying this i recommend wyoming-addons/whisper/docker-compose.example.yml at 16d3cb41d0ed6be608118e7b1587194aabbf1967 · pierrewessman/wyoming-addons · GitHub
I adopted the meat of it into my existing setup so it works different from how it’s described in the repo’s readme.

I’m running it on Ubuntu 20.04 server and installed the nvidia-driver-535-server package. You would also need Installing the NVIDIA Container Toolkit — NVIDIA Container Toolkit 1.14.5 documentation and you should be ready to go.

Another benefit of it is that i’m now able to play around with other LLM’s using Ollama and Open-Webui on the same GPU which is also surprisingly easy to install with docker. My own private AI

Just in case you’re interested in that also: GitHub - open-webui/open-webui: User-friendly WebUI for LLMs (Formerly Ollama WebUI)

cibernox · March 7, 2024, 11:01pm

Sure, but I can’t put a GPU on an intel Nuc, and it’s a quite popular device, and relatively capable too, so I wondered if someone had managed to enable GPU acceleration on it.

I’d rather not have to buy another computer.

hannemann · March 8, 2024, 9:27am

NUC… I overlooked that. Sorry

There are also adapters available for M2 slots

kitus · April 21, 2024, 3:10pm

Hi there!
Same situation here. I wonder if you got any far?
Thanks @cibernox

cibernox · April 30, 2024, 7:00pm

I did not. Seems that generally speaking AIs are getting better over time running on CPUs, and that’s something, but nothing I found seems to suggest that much effort is going into optimizing them for integrated GPUs

Kali-777 · September 29, 2024, 11:45am

SAME here i was running HA on ASUS NAS then i bought RPI 5 8GB for it now running HAOS in Qemu VM, however currently i’m thinking about buying an Intel NUC 12/13/14 for it then “hoping” to accelerate ‘AI’ kind of things like STT/TTS/Fringe/LLM-Ollama on it!
looks like LLM could be done easily…
BUT hard to find way for whisper/piper as per Copilot AI is simple as this

Using Intel Optimizations Without Code Changes

Install Intel Optimized Libraries:

Intel Distribution for Python: This distribution includes optimized versions of popular libraries like NumPy, SciPy, and scikit-learn.

conda install -c intel intelpython3_full

Set Environment Variables:

You can set environment variables to ensure that Intel’s optimizations are used.

export TF_ENABLE_ONEDNN_OPTS=1
export DNNL_VERBOSE=1

Use Pre-Optimized Containers:

Intel provides pre-optimized Docker containers that include all necessary optimizations for TensorFlow and PyTorch.

docker pull intel/intel-optimized-tensorflow
docker pull intel/intel-optimized-pytorch

Run Your Models in the Optimized Environment:

By running Whisper and Piper within these pre-optimized environments, you can benefit from Intel’s optimizations without modifying the code.

Kali-777 · September 29, 2024, 11:49am

Maybe this is the cheapest & most reliable way to get an NIVIDA eGPU Thunderbolt eGPU Extension it should work with ANY Thunderbolt

cibernox · September 29, 2024, 12:01pm

I might give it a go, that sounds promising.

Makis · October 27, 2024, 5:03pm

hi,
on the same boat here.
have you try/find anything useful ?

cibernox · November 12, 2024, 12:10pm

Nothing definitive. I am pretty determined to getting another server in early 2025. One that still has a low idle power but can handle moderately complex AI workloads.

I’ll wait to see how the new AMD APUs perform on this. If they do, I’m pretty sure I’ll go that route because those have an idle power of ~7-8w.
If not, I guess I’ll build a server with some second hand RTX 3060.

scyto · November 15, 2024, 1:50am

Yup eGPUs work just fine, using one on my truenas box where i am running ollaman container.

scyto · November 15, 2024, 1:52am

~~is there a prebuilt image on a registry somwhere, the yaml has no registry info in it~~
~~i pulled the linux server io image but that doesn’t seem to do nvidia acceleration~~

answer = i missed that the linuxserver.io image one needed to pull the tag gpu not latest, working perfectly with eGPU connected via thunderbolt

exx · November 15, 2024, 3:21am

They actually make nucs with gpus…