Are you limited to NVIDIA GPUs because of the CUDA requirement?
I would prefer to use an NPU or something less power hungry if possible.
Are you limited to NVIDIA GPUs because of the CUDA requirement?
I would prefer to use an NPU or something less power hungry if possible.
CUDA=nVidia.
These GPU solutions take VERY little power to process the short sentences we throw at Assist. They operate at P2 for less than a second and then go back to idle mode. We arenāt playing video games here. FWIW, you can find used nVidia GTX1070 cards on eBay for $50-80 (I bought one a month ago) and they work GREAT. Pop it in an old tower youāre about to chuck out and youāre golden. Iām using an old Dell Optiplex from 2011 and itās snappy.
Anyone trying to use truly local STT will be required to use a GPU or suffer intolerable delays when processing local speech. The processing power required to do inference on voice patterns is highly processor intensive, and GPUs are perfectly suited for the exactly this. CPUs just arenāt up to the task and likely never will be, at least not with all of the current CPU architectures being used. Good user voice experience requires sub-second response times. This is a very hard problem to solve, both in sw and in hw.
Itās nice to see someone has cracked the GPU nut with Whisper. I got so frustrated with the acrobatics required to do all of this in HA that Iāve jumped on the Willow+HA bandwagon and am not looking back. When I say 15 minutes from setting up the Box3 device to HA recognizing complex commands in Willow in less than 300ms, Iām not kidding. That assumes you have Ubuntu installed. And Willow now allows āPipeline Chainingā meaning the dreaded āI canāt understand thatā gets passed to Alexa or Google if you want an answer to some random question without coding custom intents for days.
All of these voice solutions will RADICALLY improve over the next year. Rome wasnāt built in a day, and just remember when (for some of us) HA was at version 0.0xā¦ itās progressed by light years since then. The same thing will happen with voice, but on an accelerated timetable.
Interesting, I am trying to power the device via PoE++ (IEEE 802.3bt) which limits me to 60W for the entire machine.
Currently I am running Home Assistant bare metal on a N5105 box powered by PoE which has been great. I have Whisper and such installed, but my CPU usage does not appear to spike over 25% when processing requests (Āæsingle threaded?). The processing time for the request and speech response is too long compared to our existing Amazon Echoes despite using the lightest models.
Ideas on how to create something are welcome. I know it is a near impossible goal (for now).
I set up Wyoming faster-whisper, piper and openwakeword today. All using GPU accel on a GTX 1660 TI. Itās fast, I just need to figure out what models I can use with it. The rhasspy models repo Releases Ā· rhasspy/models Ā· GitHub seems to only go up to medium/medium-int8.
At the moment, pipers python lib doesnāt accept the --cuda
arg even though the CPP side of it is implemented. You need to bind mount a custom __main__.py and process.py into the custom-built piper docker container for piper to use GPU. I have the GPU accelerated Wyoming containers on a separate host than HASS. So remote does work. Add whisper and piper using the Devices & Services
ā Whisper
/ Piper
pipeline. To add a remote openwakeword server, Devices & Services
ā Wyoming Protocol
and enter the remote IP and PORT.
For the person asking about the remote data center stuff, you would need a VPN connection as I donāt think there is any auth mechanisms for the Wyoming containers but, it is technically possible and latency would be whatever your pings are + processing time.
Here is my repo that I used to deploy the GPU accelerated containers (donāt forget you need nvidia-container-toolkit installed to pass GPU to docker): GitHub - baudneo/wyoming-addons-gpu at gpu
You should only need to clone the repo: git clone https://github.com/baudneo/wyoming-addons-gpu
, cd wyoming-addons-gpu
into it, make sure you are using the gpu
branch: git checkout gpu
, and then run docker compose -f docker-compose.gpu.yml up
and see if it builds properly. You can add the docker compose -d
flag after you know the containers are built properly.
git clone https://github.com/baudneo/wyoming-addons-gpu
cd wyoming-addons-gpu
git checkout gpu
docker compose -f docker-compose.gpu.yml up
If the container builds fail, you need to remove the build envs that are cached. I had to remove the containers, then the custom-built images and run docker system prune
to remove the cached build environments before I could rebuild the containers when there were issues. To rebuild, I was using docker compose up --build --force-recreate
.
The piper --cuda
arg should be implemented soon so the bind mounting of __main__.py and process.py int the piper container shouldnt be needed in the future but, it is for now.
Edit: I tried using ālarge-v2ā as the model parameter, and it throws an error:
__main__.py: error: argument --model: invalid choice: 'large-v2' (choose from 'tiny', 'tiny-int8', 'base', 'base-int8', 'small', 'small-int8', 'medium', 'medium-int8')
So it seems medium(-int8) are the best choices ATM. medium-int8 is taking up 908MB of GPU memory ā
Tue Dec 26 18:23:33 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1660 Ti Off | 00000000:3B:00.0 Off | N/A |
| 0% 41C P2 24W / 130W | 1859MiB / 6144MiB | 2% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 81077 C /usr/bin/zmc 70MiB |
| 0 N/A N/A 193753 C /opt/zomi/server/venv/bin/python3 474MiB |
| 0 N/A N/A 2115758 C python3 908MiB |
| 0 N/A N/A 3022347 C /usr/bin/zmc 246MiB |
| 0 N/A N/A 3022370 C /usr/bin/zmc 158MiB |
+---------------------------------------------------------------------------------------+
How did you get the performance numbers in your post? Iāve set up an external whisper and Iād like to see how much time is being spent in each step of the pipeline.
Go to Settings ā Voice Assistants ā Click on your faster-whisper assist pipeline. When the pipeline modal pops up, there will be a 3 dot menu button in upper right hand corner, click it and select Debug
.
It will take you to the debug screenw here you can see runs and proc times.
Iāve set up piper, whisper and openwakeword using your repo as a git submodule (my own docker-compose file extends your docker-compose.gpu.yml) and everything works great. I get almost instantaneous responses with the tiny-int8 model. Using the medium-int8 model, the processing time is around 4 seconds which is reasonable considering Iām using a Nvidia Quadro P400.
Thanks a lot for sharing your work!
For a quick experiment I had my Raspi connect to the data center unencrypted, the whisper processing just took forever, 4 times as long as compared to local on Raspi. Not sure what was going on here, the connection was fine. Maybe my virtualised server environment is really not well equipped for the respective processing.
Is your VPS GPU accelerated? Or cpu only? Try going into the assist pipeline debug and seeing how long each step is taking.
If your VPS is GPU accel, run watch -n .1 nvidia-smi
and watch the GPU mem and GPU utilization % during voice processing to make sure the GPU is handling the calls.
If itās cpu based, than the cpu or network connection is probably the bottleneck.
No, I donāt have any Jetson hardware to test with. It shouldnāt be too hard to modify the custom GPU Dockerfiles for Jetson though.
If I do ever get Jetson hardware or ssh access to one, I will try cooking up a recipe for them.
I am running the Wyoming GPU accel containers on amd64 arch.
There is a closed PR in the wyoming-faster-whisper repo to add the ability to load models from hugging face. It will allow you to use ālarge-v2ā and anything else on hugging face that is compatible (CTranslate2 compatible).
Ill try and whip up a custom repo for now, this will give users some more choices on models until more models are in the models repo that faster-whisper pulls its models from.
This is a simple CPU accelerated VPS. I have posted a assist pipeline debug above:
What I do not understand is how the initial whisper time is slower than on my Raspi4 but the other two components are fast. So the network does not seem to be an issue. And surely the CPU is not slower than my Raspi4?
Proof is in the pudding, VPS CPU is being throttled.
The 2 other components arenāt as compute intensive. What I would do is open an ssh connection to VPS, run tmux or screen with a few panes.
Have htop and systemdās joirnalctl -f running and do a voice command, watch how the system resources are being used. If the cpu gets pegged at 100% for all the time that the speech to text is being computed than you know you need more cpu.
Natural language processing and TTS donāt use much resources at all so they will compute fast, itās the ml model crunching STT that needs power.
Thanks, yes, I think you might be right. Maybe I will to some more in-depth debugging when I find the time, for now I will just put this project and puzzle aside. Cheers!
Hi, great to hear that you got Whisper working with GPU passthough on an LXC in proxmox.
Iām trying to do the same (iām not greatly experienced in proxmox, but have been running HA as a VM for half a year now), but the part Iām struggling with is the GPU passthrough. Iāve understood that I need to prepare an lxc with the nvidia container toolkit to allow passthrough, but I keep receiving error messages in the console on my lxc. Do you have a guide, or which commands did you use to get the GPU passed through to the lxc?
Any help would be greatly appreciated
Hi
So I already had GPU passthrough working in Proxmox, at least the method where you can use LXCās rather than VMs. I have my GPU passed through to a Plex LXC, an Ubuntu LXC running hashcat as well as another LXC running an LLM. If you already have the GPU successfully passed through and working in another LXC that is good, because I wonāt cover how to do that here. You can generally use the guide here with some changes depending on your particular situation.
To get it to work with fastter-whisper, I obviously had to install the same version of Nvidia graphics driver on the LXC as I had installed on the host, and make sure I had the correct additional configs in the LXC conf file on the host.
Then I had to install CUDA toolkit version 11.8 after installing the keyring; I got all this from the Nvidia CUDA toolkit archive - here.
I wish I had taken more notes because Iām pretty sure I also had to install libcudnn8 and libcublas (my libcublas is 12-1 - not the same version as CUDA but works anyway) before it actually started working.
Then I git-cloned the faster-whisper repo and ran the setup. To run it with the GPU I used --device cuda
and all was good. Works very well aside from the occasional crash due to being out of VRAM, probably due to me running too large a LLM model.
I also created a .service file so it is easier to manage.
Hope this helps!
Thanks very much for the reply. Iāve finally found (after hours of searching), a tutorial video that helped me get the nvidia drivers installed on both the host, and the LXC, and can see that itās been passed through (confirmed with the nvidia-smi command in the LXC console). For anyone elseās reference (using proxmox with lxcās), I used this video up to the point of installing the plex server.
So my next question, can you please detail a bit more about how I install the CUDA toolkit version? Do I use the standard wget + download link in the LXC console to install it?
Then Iām not sure if I understand the next sentence regarding libcublas. Can you elaborate?
Once Iāve understood those two, Iām sure I can get the compose file up and running, but would love it of you could paste a copy of yours here.
And Iām sorry if any of the above is dumb questions, the whole linux thing for me is quite a new world, so Iām still learning at the mo
No worries!
So to install the CUDA toolkit, you will need to run the following in the terminal of your LXC (assuming you are root):
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
apt update
That downloads the keyring from nvidia, installs it, and syncs your package db.
Then install the toolkit using
apt install cuda-toolkit-11-8
I would reboot the container after this.
Then, as I mentioned, I think I also had to install libcublas and libcudnn8 to get faster-whisper to work with my GPU. Canāt quite recall though, so YMMV.
apt install libcublas-12-1 libcudnn8
For running faster-whisper, I created a regular user and entered that account before proceeding.
You mentioned a compose file, however we are not using Docker, so there is no docker-compose in this case. Instead we are doing it manually, but it is not that difficult
I git-cloned the faster-whisper repo I mentioned above and ran the installer according to the instructions in the repo for the local install.
That is about it for installation! The repo also shows you how to download the model you need and a sample command to run whisper. But to use the GPU you need to add the āādeviceā flag. Here is mine:
script/run --model medium-int8 --compute-type int8 --language en --device cuda --uri 'tcp://0.0.0.0:10300' --data-dir ~/models
Might want to add āādebugā on the end so you can see more info in the terminal in case something goes wrong.
You will need to point your HA to the whisper server and you do that by adding another whisper integration in HA. In the āHostā part put the IP address of the LXC container, for port it is 10300.
Once you have it working, itās a good idea to make a .service file so you can manage the server more easily, and so it runs in the background. Create the file with the following command:
sudo systemctl edit --force --full whisper.service
Then create according to my example below. Yours will differ slightly based on your username, location of the whisper script, and whatever options you choose as flags:
[Unit]
Description=Faster Whisper
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
ExecStart=/home/<username>/wyoming-faster-whisper/script/run --model medium-int8 --compute-type int8 --beam-size 2 --language en --device cuda --uri 'tcp://0.0.0.0:10300' --data-dir /home/whisper/models
WorkingDirectory=/home/<username>/wyoming-faster-whisper
Restart=always
RestartSec=1
[Install]
WantedBy=default.target
After creating that file you need to run sudo systemctl daemon-reload
. Then running sudo systemctl enable --now whisper.service
will start whisper and enable start on boot.
Hope I didnāt miss anything!
After testing, pipier is not GPU accel because the --use-cuda
flag is not in any of the current releases. The only way to get GPU accel piper ATM is to build it. I created a fork that handles all of that for the end user.
All you should need to do is:
git clone https://github.com/baudneo/wyoming-addons-gpu.git -b build_piper
cd wyoming-addons-gpu
docker compose -f docker-compose.gpu.yml up -d
# Check logs
docker compose -f docker-compose.gpu.yml logs -f
Tue Dec 26 18:23:33 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1660 Ti Off | 00000000:3B:00.0 Off | N/A |
| 0% 41C P2 24W / 130W | 1859MiB / 6144MiB | 2% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 81077 C /usr/bin/zmc 70MiB |
| 0 N/A N/A 193753 C /opt/zomi/server/venv/bin/python3 474MiB |
| 0 N/A N/A 2115758 C python3 908MiB |
| 0 N/A N/A 3022347 C /usr/bin/zmc 246MiB |
| 0 N/A N/A 3022370 C /usr/bin/zmc 158MiB |
+---------------------------------------------------------------------------------------+
Sun Jan 7 17:29:23 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1660 Ti Off | 00000000:3B:00.0 Off | N/A |
| 23% 42C P2 24W / 130W | 2291MiB / 6144MiB | 2% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 193753 C /opt/zomi/server/venv/bin/python3 474MiB |
| 0 N/A N/A 720284 C /usr/bin/zmc 246MiB |
| 0 N/A N/A 720304 C /usr/bin/zmc 158MiB |
| 0 N/A N/A 720344 C /usr/bin/zmc 70MiB |
| 0 N/A N/A 1507862 C python3 1340MiB |
+---------------------------------------------------------------------------------------+
So, 908MB before with just whisper āmedium-int8ā and 1340MB after with piper and whisper āmedium-int8ā loaded.