Best combination for Voice Assistant performance

Hi all,

What is your experience for fastest reply on any voice command, related to combinations of Conversation Agent, STT and TTS.

Thinking of replies to questions, and performing actions like turning off a light?

I’m still not even close to get the same response time turning off a light comparing to Google Home.

Let me hear your experiences.

Hi

Well my tests using the integrated Assist have been quite nice, it’s quite fast to answer and process all locally (I have used Vosk not Whisper as I use it in french :wink:
Tested with Respeaker 2 Lite board from Seeedstudio !

Vincèn

What do you use for TTS ?

Piper, even on cpu is blazing fast for tts (only English tested). I use piper via Wyoming protocol.

I use whisper (CUDA accelerated) for STT and it is fast. It’s on par with Google and Alexa for response times (and is totally local). I am also using respeaker lite xmos board with a Xiao esp32 s3.

Very interested to learn about the CUDA accelerated way - never knew about it, could you spare couple of minutes and explain what you’ve done ?

I have a custom setup from before it was easy to do it, but now, I think it’s really easy. When I get home I’ll find the repo and instructions.

It’s a docker based solution using nvidia-container-toolkit to pass the GPU to the docker containers. It has options to use different STT and tts backends. Some have CUDA accel, some don’t.

Has wake word, STT and tts containers available. However, I would recommend using an esp32 s3 based assistant solution, which will allow wake word on device, rather than streaming 24/7 audio to the wake word host over the network.

Switch to the GPU branch and build from it.

Cool! Are you running Whisper and Piper as containers on an external device or on the same physical HW as HA ?

Same physical device compartmentalized into VMs and lxc containers. So a different Linux host via virtualization, but same physical host.

What type of Nvidia GPU are you using ?

For whisper I’m using GTX 1660 ti, more than enough compute and ram to handle that work load.

Do you run Piper on the same GPU ?

Piper no (cpu only and it’s still around 0.1s for processing, not much to gain from acceleration), wake word no, as the esp32 s3 has wake word on device.

My GPU is also doing RTSP h264 CUDA decoding for my zoneminder install and also runs my custom ml framework for zm doing object detection using yolo-nas, yolov10 and yolov8. Which goes to show that cheap older hardware can get you all sorts of goodness using ML/DL.