Run whisper on external server

PureeTofu · December 24, 2023, 7:39pm

Are you limited to NVIDIA GPUs because of the CUDA requirement?

I would prefer to use an NPU or something less power hungry if possible.

jazzmonger · December 25, 2023, 2:58pm

CUDA=nVidia.

These GPU solutions take VERY little power to process the short sentences we throw at Assist. They operate at P2 for less than a second and then go back to idle mode. We aren’t playing video games here. FWIW, you can find used nVidia GTX1070 cards on eBay for $50-80 (I bought one a month ago) and they work GREAT. Pop it in an old tower you’re about to chuck out and you’re golden. I’m using an old Dell Optiplex from 2011 and it’s snappy.

Anyone trying to use truly local STT will be required to use a GPU or suffer intolerable delays when processing local speech. The processing power required to do inference on voice patterns is highly processor intensive, and GPUs are perfectly suited for the exactly this. CPUs just aren’t up to the task and likely never will be, at least not with all of the current CPU architectures being used. Good user voice experience requires sub-second response times. This is a very hard problem to solve, both in sw and in hw.

It’s nice to see someone has cracked the GPU nut with Whisper. I got so frustrated with the acrobatics required to do all of this in HA that I’ve jumped on the Willow+HA bandwagon and am not looking back. When I say 15 minutes from setting up the Box3 device to HA recognizing complex commands in Willow in less than 300ms, I’m not kidding. That assumes you have Ubuntu installed. And Willow now allows “Pipeline Chaining” meaning the dreaded “I can’t understand that” gets passed to Alexa or Google if you want an answer to some random question without coding custom intents for days.

All of these voice solutions will RADICALLY improve over the next year. Rome wasn’t built in a day, and just remember when (for some of us) HA was at version 0.0x… it’s progressed by light years since then. The same thing will happen with voice, but on an accelerated timetable.

PureeTofu · December 25, 2023, 4:15pm

Interesting, I am trying to power the device via PoE++ (IEEE 802.3bt) which limits me to 60W for the entire machine.

Currently I am running Home Assistant bare metal on a N5105 box powered by PoE which has been great. I have Whisper and such installed, but my CPU usage does not appear to spike over 25% when processing requests (¿single threaded?). The processing time for the request and speech response is too long compared to our existing Amazon Echoes despite using the lightest models.

Ideas on how to create something are welcome. I know it is a near impossible goal (for now).

baudneo · December 27, 2023, 1:14am

I set up Wyoming faster-whisper, piper and openwakeword today. All using GPU accel on a GTX 1660 TI. It’s fast, ~~I just need to figure out what models I can use with it.~~ The rhasspy models repo Releases · rhasspy/models · GitHub seems to only go up to medium/medium-int8.

At the moment, pipers python lib doesn’t accept the --cuda arg even though the CPP side of it is implemented. You need to bind mount a custom __main__.py and process.py into the custom-built piper docker container for piper to use GPU. I have the GPU accelerated Wyoming containers on a separate host than HASS. So remote does work. Add whisper and piper using the Devices & Services → Whisper / Piper pipeline. To add a remote openwakeword server, Devices & Services → Wyoming Protocol and enter the remote IP and PORT.

For the person asking about the remote data center stuff, you would need a VPN connection as I don’t think there is any auth mechanisms for the Wyoming containers but, it is technically possible and latency would be whatever your pings are + processing time.

Here is my repo that I used to deploy the GPU accelerated containers (don’t forget you need nvidia-container-toolkit installed to pass GPU to docker): GitHub - baudneo/wyoming-addons-gpu at gpu

You should only need to clone the repo: git clone https://github.com/baudneo/wyoming-addons-gpu, cd wyoming-addons-gpu into it, make sure you are using the gpu branch: git checkout gpu, and then run docker compose -f docker-compose.gpu.yml up and see if it builds properly. You can add the docker compose -d flag after you know the containers are built properly.

git clone https://github.com/baudneo/wyoming-addons-gpu
cd wyoming-addons-gpu
git checkout gpu
docker compose -f docker-compose.gpu.yml up

If the container builds fail, you need to remove the build envs that are cached. I had to remove the containers, then the custom-built images and run docker system prune to remove the cached build environments before I could rebuild the containers when there were issues. To rebuild, I was using docker compose up --build --force-recreate.

The piper --cuda arg should be implemented soon so the bind mounting of __main__.py and process.py int the piper container shouldnt be needed in the future but, it is for now.

Edit: I tried using ‘large-v2’ as the model parameter, and it throws an error:

__main__.py: error: argument --model: invalid choice: 'large-v2' (choose from 'tiny', 'tiny-int8', 'base', 'base-int8', 'small', 'small-int8', 'medium', 'medium-int8')

So it seems medium(-int8) are the best choices ATM. medium-int8 is taking up 908MB of GPU memory →

Tue Dec 26 18:23:33 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1660 Ti     Off | 00000000:3B:00.0 Off |                  N/A |
|  0%   41C    P2              24W / 130W |   1859MiB /  6144MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     81077      C   /usr/bin/zmc                                 70MiB |
|    0   N/A  N/A    193753      C   /opt/zomi/server/venv/bin/python3           474MiB |
|    0   N/A  N/A   2115758      C   python3                                     908MiB |
|    0   N/A  N/A   3022347      C   /usr/bin/zmc                                246MiB |
|    0   N/A  N/A   3022370      C   /usr/bin/zmc                                158MiB |
+---------------------------------------------------------------------------------------+

bkprath · December 27, 2023, 2:52am

How did you get the performance numbers in your post? I’ve set up an external whisper and I’d like to see how much time is being spent in each step of the pipeline.

baudneo · December 27, 2023, 7:00am

@bkprath

Go to Settings → Voice Assistants → Click on your faster-whisper assist pipeline. When the pipeline modal pops up, there will be a 3 dot menu button in upper right hand corner, click it and select Debug.

It will take you to the debug screenw here you can see runs and proc times.

Marnalas · December 27, 2023, 9:48am

I’ve set up piper, whisper and openwakeword using your repo as a git submodule (my own docker-compose file extends your docker-compose.gpu.yml) and everything works great. I get almost instantaneous responses with the tiny-int8 model. Using the medium-int8 model, the processing time is around 4 seconds which is reasonable considering I’m using a Nvidia Quadro P400.

Thanks a lot for sharing your work!

goddib · December 27, 2023, 12:01pm

For a quick experiment I had my Raspi connect to the data center unencrypted, the whisper processing just took forever, 4 times as long as compared to local on Raspi. Not sure what was going on here, the connection was fine. Maybe my virtualised server environment is really not well equipped for the respective processing.

PureeTofu · December 27, 2023, 3:04pm

Awesome benchmarking!

Have you tried getting Whisper to run on any of the Jetson platforms?

baudneo · December 27, 2023, 4:18pm

Is your VPS GPU accelerated? Or cpu only? Try going into the assist pipeline debug and seeing how long each step is taking.

If your VPS is GPU accel, run watch -n .1 nvidia-smi and watch the GPU mem and GPU utilization % during voice processing to make sure the GPU is handling the calls.

If it’s cpu based, than the cpu or network connection is probably the bottleneck.

baudneo · December 27, 2023, 4:19pm

No, I don’t have any Jetson hardware to test with. It shouldn’t be too hard to modify the custom GPU Dockerfiles for Jetson though.

If I do ever get Jetson hardware or ssh access to one, I will try cooking up a recipe for them.

I am running the Wyoming GPU accel containers on amd64 arch.

baudneo · December 29, 2023, 6:25pm

There is a closed PR in the wyoming-faster-whisper repo to add the ability to load models from hugging face. It will allow you to use ‘large-v2’ and anything else on hugging face that is compatible (CTranslate2 compatible).

Ill try and whip up a custom repo for now, this will give users some more choices on models until more models are in the models repo that faster-whisper pulls its models from.

github.com/rhasspy/wyoming-faster-whisper

Add support to load any huggingface ASR model

rhasspy:master ← AnkushMalaker:master

opened 11:27PM - 02 Nov 23 UTC

AnkushMalaker

+105 -31

Two changes you may disagree with but I don't know how to solve: 1. Changing t…he `choices` to simply `str` to allow passing arbitrary model url strings. 2. Adding requirements - Accelerate is optional and only needed with the "use_low_cpu_memory" option. If someone only needs regular faster whisper and not huggingface models, we don't need to install torch for example. This also throws an error while running on the PI

goddib · December 31, 2023, 2:07pm

This is a simple CPU accelerated VPS. I have posted a assist pipeline debug above:

What I do not understand is how the initial whisper time is slower than on my Raspi4 but the other two components are fast. So the network does not seem to be an issue. And surely the CPU is not slower than my Raspi4?

baudneo · December 31, 2023, 3:43pm

Proof is in the pudding, VPS CPU is being throttled.

The 2 other components aren’t as compute intensive. What I would do is open an ssh connection to VPS, run tmux or screen with a few panes.

Have htop and systemd’s joirnalctl -f running and do a voice command, watch how the system resources are being used. If the cpu gets pegged at 100% for all the time that the speech to text is being computed than you know you need more cpu.

Natural language processing and TTS don’t use much resources at all so they will compute fast, it’s the ml model crunching STT that needs power.

goddib · January 1, 2024, 7:31pm

Thanks, yes, I think you might be right. Maybe I will to some more in-depth debugging when I find the time, for now I will just put this project and puzzle aside. Cheers!

celodnb · January 6, 2024, 10:08pm

Hi, great to hear that you got Whisper working with GPU passthough on an LXC in proxmox.

I’m trying to do the same (i’m not greatly experienced in proxmox, but have been running HA as a VM for half a year now), but the part I’m struggling with is the GPU passthrough. I’ve understood that I need to prepare an lxc with the nvidia container toolkit to allow passthrough, but I keep receiving error messages in the console on my lxc. Do you have a guide, or which commands did you use to get the GPU passed through to the lxc?

Any help would be greatly appreciated

cnose · January 7, 2024, 5:02pm

Hi

So I already had GPU passthrough working in Proxmox, at least the method where you can use LXC’s rather than VMs. I have my GPU passed through to a Plex LXC, an Ubuntu LXC running hashcat as well as another LXC running an LLM. If you already have the GPU successfully passed through and working in another LXC that is good, because I won’t cover how to do that here. You can generally use the guide here with some changes depending on your particular situation.

To get it to work with fastter-whisper, I obviously had to install the same version of Nvidia graphics driver on the LXC as I had installed on the host, and make sure I had the correct additional configs in the LXC conf file on the host.

Then I had to install CUDA toolkit version 11.8 after installing the keyring; I got all this from the Nvidia CUDA toolkit archive - here.

I wish I had taken more notes because I’m pretty sure I also had to install libcudnn8 and libcublas (my libcublas is 12-1 - not the same version as CUDA but works anyway) before it actually started working.

Then I git-cloned the faster-whisper repo and ran the setup. To run it with the GPU I used --device cuda and all was good. Works very well aside from the occasional crash due to being out of VRAM, probably due to me running too large a LLM model.

I also created a .service file so it is easier to manage.

Hope this helps!

celodnb · January 7, 2024, 8:59pm

Thanks very much for the reply. I’ve finally found (after hours of searching), a tutorial video that helped me get the nvidia drivers installed on both the host, and the LXC, and can see that it’s been passed through (confirmed with the nvidia-smi command in the LXC console). For anyone else’s reference (using proxmox with lxc’s), I used this video up to the point of installing the plex server.

So my next question, can you please detail a bit more about how I install the CUDA toolkit version? Do I use the standard wget + download link in the LXC console to install it?

Then I’m not sure if I understand the next sentence regarding libcublas. Can you elaborate?

Once I’ve understood those two, I’m sure I can get the compose file up and running, but would love it of you could paste a copy of yours here.

And I’m sorry if any of the above is dumb questions, the whole linux thing for me is quite a new world, so I’m still learning at the mo

cnose · January 7, 2024, 10:18pm

No worries!

So to install the CUDA toolkit, you will need to run the following in the terminal of your LXC (assuming you are root):

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb

dpkg -i cuda-keyring_1.1-1_all.deb

apt update

That downloads the keyring from nvidia, installs it, and syncs your package db.

Then install the toolkit using

apt install cuda-toolkit-11-8

I would reboot the container after this.

Then, as I mentioned, I think I also had to install libcublas and libcudnn8 to get faster-whisper to work with my GPU. Can’t quite recall though, so YMMV.

apt install libcublas-12-1 libcudnn8

For running faster-whisper, I created a regular user and entered that account before proceeding.

You mentioned a compose file, however we are not using Docker, so there is no docker-compose in this case. Instead we are doing it manually, but it is not that difficult

I git-cloned the faster-whisper repo I mentioned above and ran the installer according to the instructions in the repo for the local install.

That is about it for installation! The repo also shows you how to download the model you need and a sample command to run whisper. But to use the GPU you need to add the ‘–device’ flag. Here is mine:

script/run --model medium-int8 --compute-type int8 --language en --device cuda --uri 'tcp://0.0.0.0:10300' --data-dir ~/models

Might want to add ‘–debug’ on the end so you can see more info in the terminal in case something goes wrong.

You will need to point your HA to the whisper server and you do that by adding another whisper integration in HA. In the ‘Host’ part put the IP address of the LXC container, for port it is 10300.

Once you have it working, it’s a good idea to make a .service file so you can manage the server more easily, and so it runs in the background. Create the file with the following command:

sudo systemctl edit --force --full whisper.service

Then create according to my example below. Yours will differ slightly based on your username, location of the whisper script, and whatever options you choose as flags:

[Unit]
Description=Faster Whisper
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
ExecStart=/home/<username>/wyoming-faster-whisper/script/run --model medium-int8 --compute-type int8 --beam-size 2 --language en --device cuda --uri 'tcp://0.0.0.0:10300' --data-dir /home/whisper/models
WorkingDirectory=/home/<username>/wyoming-faster-whisper
Restart=always
RestartSec=1

[Install]
WantedBy=default.target

After creating that file you need to run sudo systemctl daemon-reload. Then running sudo systemctl enable --now whisper.service will start whisper and enable start on boot.

Hope I didn’t miss anything!

baudneo · January 8, 2024, 12:45am

After testing, pipier is not GPU accel because the --use-cuda flag is not in any of the current releases. The only way to get GPU accel piper ATM is to build it. I created a fork that handles all of that for the end user.

All you should need to do is:

git clone https://github.com/baudneo/wyoming-addons-gpu.git -b build_piper
cd wyoming-addons-gpu
docker compose -f docker-compose.gpu.yml up -d
# Check logs
docker compose -f docker-compose.gpu.yml logs -f

GPU mem before

Tue Dec 26 18:23:33 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1660 Ti     Off | 00000000:3B:00.0 Off |                  N/A |
|  0%   41C    P2              24W / 130W |   1859MiB /  6144MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     81077      C   /usr/bin/zmc                                 70MiB |
|    0   N/A  N/A    193753      C   /opt/zomi/server/venv/bin/python3           474MiB |
|    0   N/A  N/A   2115758      C   python3                                     908MiB |
|    0   N/A  N/A   3022347      C   /usr/bin/zmc                                246MiB |
|    0   N/A  N/A   3022370      C   /usr/bin/zmc                                158MiB |
+---------------------------------------------------------------------------------------+

GPU mem after

Sun Jan  7 17:29:23 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1660 Ti     Off | 00000000:3B:00.0 Off |                  N/A |
| 23%   42C    P2              24W / 130W |   2291MiB /  6144MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    193753      C   /opt/zomi/server/venv/bin/python3           474MiB |
|    0   N/A  N/A    720284      C   /usr/bin/zmc                                246MiB |
|    0   N/A  N/A    720304      C   /usr/bin/zmc                                158MiB |
|    0   N/A  N/A    720344      C   /usr/bin/zmc                                 70MiB |
|    0   N/A  N/A   1507862      C   python3                                    1340MiB |
+---------------------------------------------------------------------------------------+

So, 908MB before with just whisper ‘medium-int8’ and 1340MB after with piper and whisper ‘medium-int8’ loaded.