It would be great to rank transcription/agent/synthesis engines as opposed to choosing one for each pipeline step. The goal is to minimize latency without sacrificing capability.
Steps
Transcription
I would like to configure the pipeline to prioritize the models as such:
- Speech-To-Phrase on the N150 miniserver where Home Assistant runs.
- Nabu Cloud
faster-whisper
on my beefy workstationfaster-whisper
on the miniserver from step 1.
This way:
- If the query is simple, Speech-to-Phrase would handle it.
- If it’s not but I have an internet connection, I’d get a fast response from the cloud provider.
- If the internet is down, I’d use my workstation and its avx512 CPU.
- If my workstation is turned off, the miniserver itself (which on paper has avx2 but I can’t get it work for some reason) will handle it. Take that, Alexa.
A user who unlike me values privacy more than latency might want to rank engines differently.
Nice-to-have ideas to minimize latency subject to heavily diminishing returns:
- The audio stream needs to fan out to multiple providers which would run concurrently. The first response per the ranking order that didn’t error out should be propagated to the next step in the pipeline. The ranking is largely necessary to bias the input toward the names of the local devices via Speech-to-Phrase, but also because my local models might be smaller than the cloud one.
- It would be cool to allow expressing that steps 1 and 4 cannot be done concurrently so that the execution of 4 doesn’t starve the miniserver for resources while it’s working on step 1. If I change my mind and decide that I don’t want the workstation to do unnecessary work, I might want to order 3 after 2. So recording to a memory buffer is also necessary.
- What if my workstation is currently running at 100% utilization because it’s rebuilding Chrome? Giving 3 and 4 the same ranking (i.e. P1 > P2 > P3 = P4), and picking the first response that is produced between 3 and 4 might help.
Conversation Agent
For LLMs I would like it to go through
- Home Assist engine.
- (Extended) OpenAI Conversation integration.
- Then the local Deepseek 7B running on the workstation.
For the user who prioritizes local models over the larger cloud models, it might also be beneficial if the LLM could signal that the transcription has likely been erroneous and would request a redo of the transcription step of the pipeline with a slower, more capable model. Design space is huge here, so not part of this proposal.
Voice synthesis
Even simpler than transcription:
- Nabu Cloud.
- Piper on the workstation.
Microsoft SamPiper on the miniserver.