I would like to run full local voice at Alexa/HAcloud latency(0.2-0.4s),
I am running HAos on an E5-2620, I gave it 8 cores and 16GB memory.
Even with the largest whisper model, no swapping. The system hums
with 512MB of free memory.
With a VA Preview 090254, while NLP and TTS both executes
reasonably well(under 0.1s), the performance of faster-whisper
is poor:
per request
auto/en/beam0 2-4s
distill-small/en/beam0 6-8s
turbo/en/beam0 30s
How do I go about improving my STT latency? Hardware? Config?
Should I consider moving the ESP/STT to an Apple Silicon M4/UTM-HAos setup? What about a Windows10/Hyper-V/HaOS/+RTX 2080?
How does HA Cloud run STT? Is it a standard opensource HA
instance on a big box full of GPUs or some custom whisper setup?
Is that something one can replicate in a homelab?
Any gpu will give you good speed. even 1060 is able to use turbo model with a delay of about a second.
For apple silicon you can use whisper.cpp project
RTX2080 sounds like a good option, but not with Windows 10 and not with Hyper-V.
Windows Desktop OSes should not be used for 24/7 servers and especially not hypervisors.
Hyper-V is tricky to get hardware pass-through set up right, if it even possible at all.
I got wyoming-faster-whisper to work on the Apple M4 as an Wyoming endpoint. turbo and medium.en now both runs at about 5-8s. I assume this is because faster-whisper is not using MPS or MLX.
The current hrasspy githhub master bundled whisper.cpp 1.5.4 (upstream is at 1.7.4), even then, the bundled version utilizes Metal on an Apple M4 for acceleration - medium(0.5s) and large-v3(1.2s). I may try to build an wyoming endpoint on Windows+RTX 2080 for performance comparison if anyone is interested.
Also has anyone tried building a wyoming bridge using the native macOS speech recognition?
Now I am combatting with noise and end word termination like everyone else.
e.g. when the roomba is running or using a loud clicky keyboard, VA is not usable
(it thinks the clicking keys as dog barking).
Unlike Alexa/Google where one can speak wake word and command in one fluid sentence, In order to get a clean command recognition, I had to wait til the Wake sound before speaking the command. Is there a way around that? Can the 090254 operate by always recording?
Was able to install this MLX version of whisper-cpp with Wyoming protocol support on M1 Mac mini with 16GB RAM and the response latency is much improved over Raspberry Pi 5 HAOS / faster-whisper