Run whisper on external server

Interesting to see that CUDA does not accelerate Piper significantly. I was about to begin the journey of Piper + CUDA12 on Docker (WSL2) but you may have changed my mind.

Do you know if Piper retained the model in VRAM?
Just trying to better understand why the performance difference is so small.

Depends on your cpu, more power = less time. I imagine by now they have all the plumbing figured out, CUDA may be worth it now.

Easy way today - use original whisper.cpp from ggerganov if you have GPU and OpenAI API for home assistant plugin.

i test and adopted it now . No overhead, very fast, really very.
plugin and some instruction : GitHub - neowisard/ha_whisper.cpp_stt: Home Assistant Whisper.cpp API SST integration

For those who are running AMD hardware, I put together a container that runs faster whisper with ROCm support. I don’t think this was possible until very recently, so I think I may be among the earliest to implement it for my setup. I figured sharing it here may help some people get the best performance possible.

wyoming-faster-whisper-rocm

2 Likes

I hope the below helps someone as it took me ages to find a fully working x86-64 version.
Make sure you have cuda-container-toolkit and drivers installed.

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

This docker-compose.yaml worked for me with CUDA acceleration and wyoming support.

services:
  faster-whisper:
    image: lscr.io/linuxserver/faster-whisper:gpu
    container_name: faster-whisper-cuda-linux
    runtime: nvidia
    environment:
      - PUID=1000
      - PGID=1000
      - TZ=Europe/London
      - WHISPER_MODEL=medium-int8
      - WHISPER_BEAM=1 #optional
      - WHISPER_LANG=en #optional
    volumes:
      - /root/.cache/whisper:/config #beware of your path could be different
    ports:
      - 10300:10300
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2 # I have two CUDA capable cards
              capabilities:
                - gpu
networks: {}

Check my comment Run whisper on external server - #120 by alienatedsec , where the response time is around a second.

Has anyone used a Jetson since HA/Nabu has been working with them and they got everything moved to GPU/CUDA based regarding voice, among other things? The 8GB models dropped in price but really having a hard time finding any feedback and obviously a bit technical to setup but want to setup HA Core on a Jetson for testing. In the long run I would prefer to (eventually) run everything on one box my and you can run HA Core on the Jetson now

For us using intel machines with no GPU other than the integrated one (intel 12th gen Xe iGPU), what’s the fastest way of running whisper?
The small-int8 is the first one that kind of works, any model smaller than that is just crap. And even the small one is just meh.
A simple sentence like “Turn on the kitchen lights” takes 2.3s on small-int8 which is on the verge of being usable, 3.3s in small which feels to slow and 6.5s in medium-int8, which is maddening.

I tried running inside a dedicated container instead of as an addon but I didn’t see any noticeable speed improvement (~0.1s).
Is there some configuration on the docker or any alternative version of docker-image that would run faster without having an nvidia or amd GPU?

No, but the good news is, a GTX 1660 ti works and are about $100 cad used. Won’t do LLMs, but is good enough for this.

Yes, but the bad news is that my server is an intel NUC, so adding a GPU is not an option.
I was hoping for external m.2 TPU accelerators like Hailo 8 or similar boards would become popular enough.

Can always x1 PCIe lane to external enclosure. Depends how bad you want it.

I don’t know yet. I also care for power consumption a lot. My server sips ~6-9w when mostly-idle (which is 98% of the time for a home server). I can imagine adding an nvidia GPU would easily 5x that number.

Again, depends how bad you want it.

Sometimes, you have to have less than optimal set ups to accomplish bleeding edge tech

If you want to wait for power efficient, edge processors that have full support with whatever stack you want to use, that’s fine.

You asked specifically about if it were possible l, today, without a GPU. I simply gave you the information you requested. Hailo 8 doesn’t seem supported for this use case, but I could be wrong (memory will be the biggest issue)

And I appreciate it. It is a shame that Xe intel iGPUs are not supported. They are fairly decent actually. Maybe there are developments in the future. I’ve seen some info about a pytorch extensions with HW acceleration for intel xe graphics.

Actually, it won’t increase a lot of energy consumption because the GPU doesn’t work most of the time and only works when you’re talking to your assistant. The standby power of GPU is approximately 6W-9W. If you choose a GPU with 12GB VRAM, you can allow high-quality STT and LLM locally. 3060 is the best choice because it is very affordable and has a large number of VRAMs available.

1 Like

So - I’m about to start my journey into Whisper runnning with GPU and experience the speed you guys refer to :slight_smile:

First:
I’ve got a server with an Intel i9, 64 Gb memory, 2 TB M.2.
Running Ubuntu 22.04 LTS and already run couple of docker containers.
Purchased a second hand Nvidia A2000 that should be here in couple of days.

@baudneo @Fraddles @alienatedsec

What is the latest route to get up and running? I need to get the Nvidia (@alienatedsec I guess your URL should work) drivers in and then the correct whisper container.

I hope my hardware combined with ubuntu 22.04 should be ok?

Finally, there is a Whisper model on Huggingface I want to get added as it is a pretty solid model for Norwegian supporting wide range of dialects (don’t even try to lean Norwegian): NbAiLab/nb-whisper-large · Hugging Face

Can this be used?

The docker setup is simple as long as you have the host drivers installed and Nvidia container toolkit.

That model, I don’t think will work. I may be wrong but when I was setting my stuff up, I could only use the models that rhasspy supplied. I think the best, at the time, was a medium int8 model.

Idk if there is the ability to load in whatever models, you’ll need to experiment. I did implement the ability to load whatever models, but there was an issue. Can’t recall, but it wasn’t worth my time at that point as the supplied model was, and still is, performing well for English.

Your hardware should be more than enough for the models supplied by rhasspy. If you can load other models, you’ll need to experiment

Thanks @baudneo This image, is this recommended (image: lscr.io/linuxserver/faster-whisper:gpu) ? And, are there now any limitations to what version of Nvidia I do install as I see they now support Ubuntu 22.04 (you write quite earlier do not use > v11.x ?

I personally use a modified version of @edurenye repo.

IIRC, the comment about cuda had something to do with building piper and onnxruntime gpu. You can just try it out and see if you get any errors. If you get weird errors, switch container tags to a newer or older version of CUDA/cudnn.

There shouldn’t be issues though, it’s fairly straightforward in these containers as they’re built for this purpose and to be user friendly.