I would like to know if there is any plans in the works to offload speech and recognition processing to an Edge TPU like Coral. If so, I would like to know the rough timeline. This would increase speed of recognition & speech generation, while lowering CPU demand.
This is probably outside the scope of what is possible for the developers.
If you look at Rhasspy then you will see that all those features are provided by third party engines, like raven, deepspeech, larynx and so on.
It is these engines that have to be built with the feature for HA to be able to use them.
Most of these go for graphic cards instead, because even a cheap passive cooled graphic card will easily beat a coral device.
Read this post for computational power of graphic cards and then the post two times below for the computational power of the coral device.
The Coral supports a maximum TensorFlow Lite model size of 6MB. This works well for things like object detection/classification across something like 100 labels with Frigate, for example.
Detecting all of human speech is a completely different task. The smallest (not very good) speech recognition model is roughly 75MB - completely impossible for the Coral to use. The best one is roughly 1GB. $100 used Nvidia GPUs have 8GB of VRAM and can load all of our models simultaneously and consume about 50% of the available VRAM while idling at 8 watts.
Because of the memory limitations the Coral is fundamentally impossible to use for wide spread speech recognition. You can see from their examples they are able to detect maybe 140 speech phrases, which the on device command recognition of the ESP BOX supports 400.
Just to add some doubt here, but in an Android phone there’s the option to have offline speech-to-text models running. Voice Recognition - Tech for Learning - Library Guides at University of Plymouth. I know Google plays with an advantage of resources vs Nabu or opensources but I understand that, technologically, it should be possible to have human speech running very fast on embedded devices, even without dedicated TPUs.