Help in finding a reliable STT for Italian

Jokerigno · March 25, 2025, 6:23pm

Hi all,

I’m struggling to find a reliable way to cut the cord from google home minis.

I spent time and money to test some platform but at the end the experience is still poor.

I mainly use a HA Voice PE to do my test so at least mic should be good.

For the STT i tested

Whisper = slow and poor recognition
Vosk = better experience but still not enough

I then moved the pipeline from my Celeron Nuc to my Xeon media server (a dell t20) and tried again the 2 STT named above but still recognition was poor.

I then added a Nvidia GeForce GTX 1660 6GB and used whisper:gpu but still time for recognition is high and quality is poor.

Lately I found whisperx and other container but my knowlegde is limited so I’m not able to build a container from scratch.

There’s some good samaritan that can support me in this journey. I don’t want to lose this fight.

Thanks!

mchk · March 26, 2025, 11:11am

How long?
The turbo model on the 1060 gives an average latency of 1 second.

Jokerigno · March 26, 2025, 11:01pm

uhm maybe is not using gpu at all.

I checked the logs and found this

INFO:faster_whisper:Processing audio with duration 00:02.820
INFO:wyoming_faster_whisper.handler:!!!

mchk · March 26, 2025, 11:48pm

check the device load in nvtop

Jokerigno · March 27, 2025, 6:53am

seems the GPU is used…

mchk · March 27, 2025, 8:34am

Give me an example of the phrase used here. I will try to make measurements on my hardware.

Jokerigno · March 27, 2025, 9:52am

I used a simple ACCENDI SCRIVANIA that means turn on scrivania (a light called with that name)

mchk · March 27, 2025, 10:31am

_{The second gpu is not involved in the recognition.}

Right now I’m using the container from slackr31337. But the service from the original repository runs at an identical speed.

Jokerigno · March 27, 2025, 10:52am

Can you share the details of your docker run? Just to see if there’s any difference

mchk · March 27, 2025, 10:57am

Pretty standard configuration

services:
  wyoming-whisper:
    image: slackr31337/wyoming-whisper-gpu:latest
    container_name: wyoming-whisper
    environment:
      - MODEL=turbo
      - LANGUAGE=ru
      - COMPUTE_TYPE=int8
      - BEAM_SIZE=5
    ports:
      - 10300:10300
    volumes:
      - /home/mchk/data:/data
    restart: unless-stopped
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities:
                - gpu
                - utility
                - compute

And also
NVIDIA-SMI 570.86.15 Driver Version: 570.86.15 CUDA Version: 12.8

mchk · March 27, 2025, 11:03am

If nothing can be fixed, use cloud integration. Groq can be used for free up to a certain limit.

Jokerigno · March 29, 2025, 9:14pm

it continues to be unprecise.

Regarding the cloud solution well I spent money to buy a GPU to have all at home…

dzmiller · March 30, 2025, 2:01am

Have you installed pytorch?

I get great results with whisper-large-v3-turbo. But you may not have enough vram. The big whisper versions are very good and also a bit better with Italian than English.

Jokerigno · March 30, 2025, 9:43am

No but I tried large v3 and turbo mode but still performance is worst that vosk.

mchk · March 30, 2025, 9:56am

The installation method does not affect the quality of speech recognition. Running a local service, docker or cloud provider - whisper (on the same model) will produce identical results.

Your problem can be divided into two components.
The increased processing time is probably related to the installation method, it is difficult to give any advice.

As for the recognition quality, you can check this by temporarily using the cloud integration by selecting an identical model. This way you can find out if there is a problem with whisper in general, or only with local installation…

dzmiller · March 30, 2025, 2:17pm

I doubt you are running on the graphics card. The only way I know is to use Pytorch with Cuda.

Jokerigno · April 1, 2025, 6:28pm

if you check here you’ll see that my GPU is triggered when STT happens so I presume i do run my container using graphic card.

Jokerigno · April 1, 2025, 6:31pm

Are you sure about that? In my idea the more calculation power I have the more speech is captioned correctly.
For example ALL the italian user that have cloud report a good quality in recognition. And in my idea the tool used should be the same (whisper) so the different in result can only be a consequence of the configuration.

gurrasv985 · April 2, 2025, 8:14am

Hi man, trying to set up whisper with gpu on my computer running a rtx 3060.

How did you expose the gpu to the container?

Been trying literally every night for 1-2 hours and all i get is: “CUDA failed with error named symbol not found”. I dont understand what im missing here.

Would love some help

mchk · April 2, 2025, 9:02am

The installation method and the performance of the gpu are different things.

That’s a valid point, but your graphics card offers strong enough performance (processing short phrases in about one second) to deliver latency comparable to cloud-based solutions.