Help in finding a reliable STT for Italian

Hi all,

I’m struggling to find a reliable way to cut the cord from google home minis.

I spent time and money to test some platform but at the end the experience is still poor.

I mainly use a HA Voice PE to do my test so at least mic should be good.

For the STT i tested

  1. Whisper = slow and poor recognition
  2. Vosk = better experience but still not enough

I then moved the pipeline from my Celeron Nuc to my Xeon media server (a dell t20) and tried again the 2 STT named above but still recognition was poor.

I then added a Nvidia GeForce GTX 1660 6GB and used whisper:gpu but still time for recognition is high and quality is poor.

Lately I found whisperx and other container but my knowlegde is limited so I’m not able to build a container from scratch.

There’s some good samaritan that can support me in this journey. I don’t want to lose this fight.

Thanks!

How long?
The turbo model on the 1060 gives an average latency of 1 second.

uhm maybe is not using gpu at all.

I checked the logs and found this

INFO:faster_whisper:Processing audio with duration 00:02.820
INFO:wyoming_faster_whisper.handler:!!!

check the device load in nvtop


seems the GPU is used…

Give me an example of the phrase used here. I will try to make measurements on my hardware.

I used a simple ACCENDI SCRIVANIA that means turn on scrivania (a light called with that name)

The second gpu is not involved in the recognition.

Right now I’m using the container from slackr31337. But the service from the original repository runs at an identical speed.

Can you share the details of your docker run? Just to see if there’s any difference

Pretty standard configuration

services:
  wyoming-whisper:
    image: slackr31337/wyoming-whisper-gpu:latest
    container_name: wyoming-whisper
    environment:
      - MODEL=turbo
      - LANGUAGE=ru
      - COMPUTE_TYPE=int8
      - BEAM_SIZE=5
    ports:
      - 10300:10300
    volumes:
      - /home/mchk/data:/data
    restart: unless-stopped
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities:
                - gpu
                - utility
                - compute

And also
NVIDIA-SMI 570.86.15 Driver Version: 570.86.15 CUDA Version: 12.8

If nothing can be fixed, use cloud integration. Groq can be used for free up to a certain limit.

it continues to be unprecise.

Regarding the cloud solution well I spent money to buy a GPU to have all at home…

Have you installed pytorch?

I get great results with whisper-large-v3-turbo. But you may not have enough vram. The big whisper versions are very good and also a bit better with Italian than English.

No but I tried large v3 and turbo mode but still performance is worst that vosk.

The installation method does not affect the quality of speech recognition. Running a local service, docker or cloud provider - whisper (on the same model) will produce identical results.

Your problem can be divided into two components.
The increased processing time is probably related to the installation method, it is difficult to give any advice.

As for the recognition quality, you can check this by temporarily using the cloud integration by selecting an identical model. This way you can find out if there is a problem with whisper in general, or only with local installation…

I doubt you are running on the graphics card. The only way I know is to use Pytorch with Cuda.

if you check here you’ll see that my GPU is triggered when STT happens so I presume i do run my container using graphic card.

Are you sure about that? In my idea the more calculation power I have the more speech is captioned correctly.
For example ALL the italian user that have cloud report a good quality in recognition. And in my idea the tool used should be the same (whisper) so the different in result can only be a consequence of the configuration.

Hi man, trying to set up whisper with gpu on my computer running a rtx 3060.

How did you expose the gpu to the container?

Been trying literally every night for 1-2 hours and all i get is: “CUDA failed with error named symbol not found”. I dont understand what im missing here.

Would love some help

The installation method and the performance of the gpu are different things.

That’s a valid point, but your graphics card offers strong enough performance (processing short phrases in about one second) to deliver latency comparable to cloud-based solutions.