Voice PE much slower than s3 box 3

So, my voice PE arrived today, and after some testing, I’m rather disappointed.
Literally side by side, its much much slower than the box s3

I have some automations based on custom sentences, and whilst both are reasonably fast to action the automation, I also have a text to speech custom response set, and the voice PE is several seconds slower to issue to tts response.

Also, both seem to be about as good as each other detecting the wake word. With me across the room, and background music playing, they seem comparable. If I put background speech on then neither get the wake word when spoken.

Also, the accuracy of the speech detection is about as bad as the s3. I’ve had to bump the local whisper model to medium.en to get reasonable speech recognition, but it puts a few seconds delay on it.

Both are set to the default for ‘finish speaking detection’.

And the last thing I’m not sure of, is why the tts speech is so slow (on both, but much more so on the Voice PE). The logs show the spoken command is recognised in around 2-3 seconds, but it can take 10+ seconds to get the tts response.

Could you give us a bit more information about how you are handling speech recognition and tts? And how is the S3 configured? I believe there are several things you can flash it with.

So, a bit more testing and it seems the main bottle neck is faster-whisper.
Despite the logs saying it detects speech in 2-3 seconds, if I turn down the model size from large to tiny, I’m getting the whole response and reply all done in 2 seconds

I’m running it on an old dell micro 5070 with an i7-8700T cpu.

Problem is that tiny is pretty bad at voice recognition. Words like movie are often misheard as ‘moving’, and many others.

Still not sure why the VPE is so much slower…

Further testing and regardless of the faster-whisper model thats used, the s3 box 3 is always noticeably faster than the Voice PE, and they are both about the same at recognizing the wake word.

The larger the whisper model, the faster box3 is, vs VPE. Not sure why that is.

Interestingly, if I change the speech to text engine to Google, both devices now respond in about the same time (almost instantly).