Whisper on GPU

Intro

I’m running Whisper with (Nvidia) GPU support for local speech to text (STT) recognition. It was surprisingly difficult to find all required information in one place, so I thought I’ll share my results:

To be precise, I’m using the Faster Whisper implementation. There is a linuxserver/faster-whisper docker image, which adds the Wyoming protocol for HA along with GPU support.

I’ve kept my German configuration example as I think this is mostly interesting for non-English setups. Should be easy to adjust though.

Setup

I’m running it externally on a basic Ubuntu server.

Setup: Prerequisites

You’ll first need to install docker (incl. docker compose), if not done yet.

A little less common may be the requirement for Nvidia Container Toolkit to be installed on your host. Example instructions for Ubuntu:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update

export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.8-1
  sudo apt-get install -y \
      nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}

sudo nvidia-ctk runtime configure --runtime=docker

sudo systemctl restart docker

Setup: Docker Compose Service

Example setup for /opt/faster-whisper:

cd /opt
sudo mkdir faster-whisper
sudo chown server:server faster-whisper
cd faster-whisper
vim docker-compose.yml

docker-compose.yml

services:
  faster-whisper:
    image: lscr.io/linuxserver/faster-whisper:gpu
    container_name: faster-whisper
    environment:
      - PUID=1000
      - PGID=1000
      - TZ=Europe/Berlin
      - WHISPER_MODEL=large-v3
      - WHISPER_LANG=de
      - WHISPER_BEAM=20
      - LOG_LEVEL=DEBUG
    volumes:
      - ./faster-whisper/data:/config
      - ./faster-whisper/run:/etc/s6-overlay/s6-rc.d/svc-whisper/run
    ports:
      - 10300:10300
    restart: unless-stopped
    network_mode: host
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities:
                - gpu
                - utility
                - compute

Setup: Override initial-prompt (optional)

Unfortunately, the docker image does not expose the initial-prompt option yet. So I’ll simply duplicated the file with the relevant command to bind-mount it:

sudo vim /opt/faster-whisper/faster-whisper/run

faster-whisper/run

#!/command/with-contenv bash
# shellcheck shell=bash

export LD_LIBRARY_PATH=$(python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__path__[0]) + "/lib:" + os.path.dirname(nvidia.cudnn.lib.__path__[0]) + "/lib")')

exec \
    s6-notifyoncheck -d -n 300 -w 1000 -c "nc -z localhost 10300" \
        s6-setuidgid abc python3 -m wyoming_faster_whisper \
        --uri 'tcp://0.0.0.0:10300' \
        --device cuda \
        --model "${WHISPER_MODEL}" \
        --beam-size "${WHISPER_BEAM:-1}" \
        --language "${WHISPER_LANG:-en}" \
        --data-dir /config \
        --download-dir /config \
        --initial-prompt "Du sollst primär Sprachbefehle für unser Smart Home erkennen. Alle Sätze sind Befehle oder Fragen. Entferne jegliche Hinweise auf Untertitel. Ein Satz fängt selten mit Ich an."

Setup: Final Steps

Run:

docker compose up [-d]

For use with Home Assistant Assist, add the Wyoming integration and supply the hostname/IP and port that Whisper is running on.

Tuning

Tuning: Beam Size

Try different values for WHISPER_BEAM. Default of the original Whisper implementation is 1. Default of Faster Whisper is 5. Higher values should result in higher accuracy, but also higher VRAM usage.

I’m currently using a quite high value of 20, because at least for German accuracy still seems to be a more relevant problem than speed.

Tuning: Model

You can adjust WHISPER_MODEL to any of the predefined models:

_MODELS = {
    "tiny.en": "Systran/faster-whisper-tiny.en",
    "tiny": "Systran/faster-whisper-tiny",
    "base.en": "Systran/faster-whisper-base.en",
    "base": "Systran/faster-whisper-base",
    "small.en": "Systran/faster-whisper-small.en",
    "small": "Systran/faster-whisper-small",
    "medium.en": "Systran/faster-whisper-medium.en",
    "medium": "Systran/faster-whisper-medium",
    "large-v1": "Systran/faster-whisper-large-v1",
    "large-v2": "Systran/faster-whisper-large-v2",
    "large-v3": "Systran/faster-whisper-large-v3",
    "large": "Systran/faster-whisper-large-v3",
    "distil-large-v2": "Systran/faster-distil-whisper-large-v2",
    "distil-medium.en": "Systran/faster-distil-whisper-medium.en",
    "distil-small.en": "Systran/faster-distil-whisper-small.en",
    "distil-large-v3": "Systran/faster-distil-whisper-large-v3",
    "distil-large-v3.5": "distil-whisper/distil-large-v3.5-ct2",
    "large-v3-turbo": "mobiuslabsgmbh/faster-whisper-large-v3-turbo",
    "turbo": "mobiuslabsgmbh/faster-whisper-large-v3-turbo",
}

Alternatively, you can also easily use any CTranslate2-compatible model. This link gives you a filtered list of supported models on Hugging Face. Just use the model name (e.g. guillaumekln/faster-whisper-large-v2) from the website for the value of WHISPER_MODEL and Faster Whisper will automatically take care of downloading it etc.

Again, I prefer accuracy and so I’m using the “best” default model large-v3. I’ve also tried
quite a few others, but couldn’t find a better one, even though it may have been optimized for German etc.

Tuning: Initial Prompt

You may want to optimize the --initial-prompt setting mentioned earlier. I haven’t put too many thoughts into this yet though.

Performance

On my 3090 RTX, local speech-to-text recognition now takes about half a second. Quality is the best I could get so far. To be honest, it’s still rather acceptable and not too good yet. Looking forward on receiving feedback to optimize.

3 Likes

Rhasspy has docker image as well

They also have stt and tts so I use they individually for my local assistant.

Rhasspy has an open source voice assistant that has integration with HA as well. This was added a while back but I didn’t remember this until I tried to find link.

Thanks for the suggestion. That’s kind of “the official” HA way to go, isn’t it? The problem is that this version does not have the required GPU support yet (see e.g. this PR).

On my CPU, I can run only very small models. Unfortunately, those ones are absolutely useless for German.

1 Like