Run whisper on external server

PureeTofu · June 27, 2024, 6:37pm

Interesting to see that CUDA does not accelerate Piper significantly. I was about to begin the journey of Piper + CUDA12 on Docker (WSL2) but you may have changed my mind.

Do you know if Piper retained the model in VRAM?
Just trying to better understand why the performance difference is so small.

baudneo · June 27, 2024, 8:34pm

Depends on your cpu, more power = less time. I imagine by now they have all the plumbing figured out, CUDA may be worth it now.

python · July 2, 2024, 12:16am

neowisard · July 3, 2024, 2:28pm

Easy way today - use original whisper.cpp from ggerganov if you have GPU and OpenAI API for home assistant plugin.

i test and adopted it now . No overhead, very fast, really very.
plugin and some instruction : GitHub - neowisard/ha_whisper.cpp_stt: Home Assistant Whisper.cpp API SST integration

Donkey545 · August 17, 2024, 7:49pm

For those who are running AMD hardware, I put together a container that runs faster whisper with ROCm support. I don’t think this was possible until very recently, so I think I may be among the earliest to implement it for my setup. I figured sharing it here may help some people get the best performance possible.

wyoming-faster-whisper-rocm

alienatedsec · September 10, 2024, 8:41pm

I hope the below helps someone as it took me ages to find a fully working x86-64 version.
Make sure you have cuda-container-toolkit and drivers installed.

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

This docker-compose.yaml worked for me with CUDA acceleration and wyoming support.

services:
  faster-whisper:
    image: lscr.io/linuxserver/faster-whisper:gpu
    container_name: faster-whisper-cuda-linux
    runtime: nvidia
    environment:
      - PUID=1000
      - PGID=1000
      - TZ=Europe/London
      - WHISPER_MODEL=medium-int8
      - WHISPER_BEAM=1 #optional
      - WHISPER_LANG=en #optional
    volumes:
      - /root/.cache/whisper:/config #beware of your path could be different
    ports:
      - 10300:10300
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2 # I have two CUDA capable cards
              capabilities:
                - gpu
networks: {}

alienatedsec · September 10, 2024, 8:44pm

Check my comment Run whisper on external server - #120 by alienatedsec , where the response time is around a second.

ginandbacon · September 21, 2024, 3:49am

Has anyone used a Jetson since HA/Nabu has been working with them and they got everything moved to GPU/CUDA based regarding voice, among other things? The 8GB models dropped in price but really having a hard time finding any feedback and obviously a bit technical to setup but want to setup HA Core on a Jetson for testing. In the long run I would prefer to (eventually) run everything on one box my and you can run HA Core on the Jetson now

cibernox · September 25, 2024, 10:47pm

For us using intel machines with no GPU other than the integrated one (intel 12th gen Xe iGPU), what’s the fastest way of running whisper?
The small-int8 is the first one that kind of works, any model smaller than that is just crap. And even the small one is just meh.
A simple sentence like “Turn on the kitchen lights” takes 2.3s on small-int8 which is on the verge of being usable, 3.3s in small which feels to slow and 6.5s in medium-int8, which is maddening.

I tried running inside a dedicated container instead of as an addon but I didn’t see any noticeable speed improvement (~0.1s).
Is there some configuration on the docker or any alternative version of docker-image that would run faster without having an nvidia or amd GPU?

baudneo · September 26, 2024, 10:06pm

No, but the good news is, a GTX 1660 ti works and are about $100 cad used. Won’t do LLMs, but is good enough for this.

cibernox · September 26, 2024, 10:23pm

Yes, but the bad news is that my server is an intel NUC, so adding a GPU is not an option.
I was hoping for external m.2 TPU accelerators like Hailo 8 or similar boards would become popular enough.

baudneo · September 26, 2024, 10:24pm

Can always x1 PCIe lane to external enclosure. Depends how bad you want it.

cibernox · September 26, 2024, 10:30pm

I don’t know yet. I also care for power consumption a lot. My server sips ~6-9w when mostly-idle (which is 98% of the time for a home server). I can imagine adding an nvidia GPU would easily 5x that number.

baudneo · September 26, 2024, 10:40pm

Again, depends how bad you want it.

Sometimes, you have to have less than optimal set ups to accomplish bleeding edge tech

If you want to wait for power efficient, edge processors that have full support with whatever stack you want to use, that’s fine.

You asked specifically about if it were possible l, today, without a GPU. I simply gave you the information you requested. Hailo 8 doesn’t seem supported for this use case, but I could be wrong (memory will be the biggest issue)

github.com

hailo-ai/hailo_model_zoo/blob/9e26452cd49c211bae392a6070b0741c2a73ebb9/docs/PUBLIC_MODELS.rst

Hailo provides different pre-trained models in ONNX / TF formats and pre-compiled HEF (Hailo Executable Format) binary file to execute on the Hailo devices.

.. list-table::
   :widths: 31 9 7 11 9 9
   :header-rows: 1

   * - Task Type
     - Hailo-8
     - Hailo-8L
     - Hailo-15H
     - Hailo-15M
     - Hailo-10
   * - Classification
     - `Link <public_models/HAILO8/HAILO8_classification.rst>`_
     - `Link <public_models/HAILO8L/HAILO8L_classification.rst>`_
     - `Link <public_models/HAILO15H/HAILO15H_classification.rst>`_
     - `Link <public_models/HAILO15M/HAILO15M_classification.rst>`_
     - `Link <public_models/HAILO10/HAILO10_classification.rst>`_
   * - Object Detection
     - `Link <public_models/HAILO8/HAILO8_object_detection.rst>`_

This file has been truncated. show original

cibernox · September 26, 2024, 11:43pm

And I appreciate it. It is a shame that Xe intel iGPUs are not supported. They are fairly decent actually. Maybe there are developments in the future. I’ve seen some info about a pytorch extensions with HW acceleration for intel xe graphics.

NIUB · September 27, 2024, 7:28pm

Actually, it won’t increase a lot of energy consumption because the GPU doesn’t work most of the time and only works when you’re talking to your assistant. The standby power of GPU is approximately 6W-9W. If you choose a GPU with 12GB VRAM, you can allow high-quality STT and LLM locally. 3060 is the best choice because it is very affordable and has a large number of VRAMs available.

TheStigh · October 29, 2024, 11:20pm

So - I’m about to start my journey into Whisper runnning with GPU and experience the speed you guys refer to

First:
I’ve got a server with an Intel i9, 64 Gb memory, 2 TB M.2.
Running Ubuntu 22.04 LTS and already run couple of docker containers.
Purchased a second hand Nvidia A2000 that should be here in couple of days.

@baudneo @Fraddles @alienatedsec

What is the latest route to get up and running? I need to get the Nvidia (@alienatedsec I guess your URL should work) drivers in and then the correct whisper container.

I hope my hardware combined with ubuntu 22.04 should be ok?

Finally, there is a Whisper model on Huggingface I want to get added as it is a pretty solid model for Norwegian supporting wide range of dialects (don’t even try to lean Norwegian): NbAiLab/nb-whisper-large · Hugging Face

Can this be used?

baudneo · October 31, 2024, 2:54pm

The docker setup is simple as long as you have the host drivers installed and Nvidia container toolkit.

That model, I don’t think will work. I may be wrong but when I was setting my stuff up, I could only use the models that rhasspy supplied. I think the best, at the time, was a medium int8 model.

Idk if there is the ability to load in whatever models, you’ll need to experiment. I did implement the ability to load whatever models, but there was an issue. Can’t recall, but it wasn’t worth my time at that point as the supplied model was, and still is, performing well for English.

Your hardware should be more than enough for the models supplied by rhasspy. If you can load other models, you’ll need to experiment

TheStigh · November 1, 2024, 4:09pm

Thanks @baudneo This image, is this recommended (image: lscr.io/linuxserver/faster-whisper:gpu) ? And, are there now any limitations to what version of Nvidia I do install as I see they now support Ubuntu 22.04 (you write quite earlier do not use > v11.x ?

baudneo · November 1, 2024, 4:58pm

I personally use a modified version of @edurenye repo.

IIRC, the comment about cuda had something to do with building piper and onnxruntime gpu. You can just try it out and see if you get any errors. If you get weird errors, switch container tags to a newer or older version of CUDA/cudnn.

There shouldn’t be issues though, it’s fairly straightforward in these containers as they’re built for this purpose and to be user friendly.