Well my tests using the integrated Assist have been quite nice, it’s quite fast to answer and process all locally (I have used Vosk not Whisper as I use it in french
Tested with Respeaker 2 Lite board from Seeedstudio !
Piper, even on cpu is blazing fast for tts (only English tested). I use piper via Wyoming protocol.
I use whisper (CUDA accelerated) for STT and it is fast. It’s on par with Google and Alexa for response times (and is totally local). I am also using respeaker lite xmos board with a Xiao esp32 s3.
I have a custom setup from before it was easy to do it, but now, I think it’s really easy. When I get home I’ll find the repo and instructions.
It’s a docker based solution using nvidia-container-toolkit to pass the GPU to the docker containers. It has options to use different STT and tts backends. Some have CUDA accel, some don’t.
Has wake word, STT and tts containers available. However, I would recommend using an esp32 s3 based assistant solution, which will allow wake word on device, rather than streaming 24/7 audio to the wake word host over the network.
Piper no (cpu only and it’s still around 0.1s for processing, not much to gain from acceleration), wake word no, as the esp32 s3 has wake word on device.
My GPU is also doing RTSP h264 CUDA decoding for my zoneminder install and also runs my custom ml framework for zm doing object detection using yolo-nas, yolov10 and yolov8. Which goes to show that cheap older hardware can get you all sorts of goodness using ML/DL.