Whisper: performances in self-hosted for French

Depends on the CPU you’re comparing it to but it will almost certainly be dramatically faster. If you look at our Willow Inference Server benchmarks (we use the same engine as faster-whisper) you will see that a GTX 1070 is roughly 5x faster than an AMD Threadripper PRO 5955WX, which is a ridiculous CPU. On the other end of the spectrum it is 119x faster (yes, really) than a Raspberry Pi 4.

The GTX 960 is a generation and model lower than a GTX 1070 but it’s probably at least 20x faster than whatever CPU you’re using.

1 Like

The HA is in one machine, and the GTX 970 is in another. So, I have no idea how to proceed. What’s easier to do, change the HA to the graphics card server or install a stand alone Whisper on that? Can you tell me where I can find a video/tutorial on how to do any of these procedures?
@kristiankielhofner

I’m the founder of Willow. I have very limited experience with faster-whisper, Wyoming, and the other components of HA Voice. I’m here to help out generally where I can but unfortunately I don’t know how you would go about using a GPU with HA and those components.

1 Like

I was trying to setup STT with whisper on my Raspberry PI4 but it looks it will be difficult to have something efficient for french?

The Raspberry Pi is challenging in terms of performance and accuracy with English, I would imagine French is much worse. In the evaluations for the reference Whisper implementation French has 2x as many errors as English - with the highest possible accuracy settings and biggest model.

When you factor in the fundamental performance and accuracy issues and add noisy speech from most voice assistant hardware setups I doubt it will really work at all.

1 Like

it works by the way but not as good as I want :slight_smile:

That’s good to hear but there is “works” and works.

Something like turning a light on and off is pretty easy to transcribe accurately. It also seems like many in the HA Voice community are still at the testing/experimentation stage. If you look at Youtube videos, etc they are purposefully and very consciously speaking very slowly and very clearly under ideal conditions (no background noise, etc). In many cases they’re also less than 1m away from the microphone. That is very far from the real world voice assistant use case where people mumble from ~5m away with background noise, echo, etc.

Now even turning a light on/off is much harder and functionality competitive with commercial solutions with much more complex grammar (asking to play a song, set a timer, make a calendar entry, add to a shopping list, etc) is going to be well out of reach.

had a look to willow, nice project :slight_smile: if I succeed in running an inference server on a good server with the hardware required does it integrate well with HA voice assistant?

Thanks!

Yes it does. For all of the speech tasks it uses the exact same voice pipelines, intents, etc native HA Voice does.

The architecture, hardware, and the entire approach is completely different but thanks to the openness of HA and the APIs, etc we integrate with it very well.

1 Like

You can run rhe rhasspy/wyoming-whisper docker container on your gpu machine and use its ip (the host one) and port 10300 when adding the wyoming device in HA.

1 Like

I use the VirtualBox inside the Windows machine, and inside VirtualBox there’s the HA. So i cant have both (VirtualBox and WSL running together) not in a efficient way that i know at least.

Hi,

I just tried Whisper too (Wyoming-whisper) in French and it’s unusable too. I tried multiple models, multiples way of saying it, using aliases… I didn’t succeed to trigger a single thing with it yet.

For example, “Eteindre table de nuit” becomes “Et temps rétables de lundit”. There is nearly always words that doesn’t exist.

I am running on an Odroid N2+, container installation. Is there a way of making it a bit better or no solution for now ?

Have you found a way to use Whisper-JAX in Home Assistant? Its speed utilizing a Coral TPU looks promising.

Not yet, I’ve used an “old” RTX 2060 with medium-int8 model and it works pretty well in french, no more trouble with bad words (éteins c’est éteins and no more weird stuffs). I’m about at 1 second of pipeline (with open wakeword) so I didn’t dig into JAX for the moment…

Hello,

You can also try vosk addon from Synesthesiam, very fast and accurate with french language ( less than 2 seconds on RPI4)
hassio-addons/vosk at master · rhasspy/hassio-addons (github.com)

1 Like

Hello @will35 , thanks for your link that provide some optimism for french users like me!

I’ll try it tonight. Do you have any specific recommendations for setting the addon ?
The documentation is rich and it seems there are a lot of possible tweaks.

Hello @ndrkxd

Just default installation , choose fr language and allow unknown … that’s all and work very fine

Can confirm, super impressed by the performances compared to fast-whisper.
It is not perfect but it’s definitively usable on a RPI4.

Thanks again!

1 Like

Just to thank you, and to confirm that vosk is very accurate and so fast in French.

1 Like

Just tried Vosk too in French. It’s awesome. It takes 0.05s for STT, while Whisper tales 3.6s to 3.8s. It’s absolutely insane!