Assist fallbacks

It would be great to rank transcription/agent/synthesis engines as opposed to choosing one for each pipeline step. The goal is to minimize latency without sacrificing capability.

Steps

Transcription

I would like to configure the pipeline to prioritize the models as such:

  1. Speech-To-Phrase on the N150 miniserver where Home Assistant runs.
  2. Nabu Cloud
  3. faster-whisper on my beefy workstation
  4. faster-whisper on the miniserver from step 1.

This way:

  • If the query is simple, Speech-to-Phrase would handle it.
  • If it’s not but I have an internet connection, I’d get a fast response from the cloud provider.
  • If the internet is down, I’d use my workstation and its avx512 CPU.
  • If my workstation is turned off, the miniserver itself (which on paper has avx2 but I can’t get it work for some reason) will handle it. Take that, Alexa.

A user who unlike me values privacy more than latency might want to rank engines differently.

Nice-to-have ideas to minimize latency subject to heavily diminishing returns:

  1. The audio stream needs to fan out to multiple providers which would run concurrently. The first response per the ranking order that didn’t error out should be propagated to the next step in the pipeline. The ranking is largely necessary to bias the input toward the names of the local devices via Speech-to-Phrase, but also because my local models might be smaller than the cloud one.
  2. It would be cool to allow expressing that steps 1 and 4 cannot be done concurrently so that the execution of 4 doesn’t starve the miniserver for resources while it’s working on step 1. If I change my mind and decide that I don’t want the workstation to do unnecessary work, I might want to order 3 after 2. So recording to a memory buffer is also necessary.
  3. What if my workstation is currently running at 100% utilization because it’s rebuilding Chrome? Giving 3 and 4 the same ranking (i.e. P1 > P2 > P3 = P4), and picking the first response that is produced between 3 and 4 might help.

Conversation Agent

For LLMs I would like it to go through

  1. Home Assist engine.
  2. (Extended) OpenAI Conversation integration.
  3. Then the local Deepseek 7B running on the workstation.

For the user who prioritizes local models over the larger cloud models, it might also be beneficial if the LLM could signal that the transcription has likely been erroneous and would request a redo of the transcription step of the pipeline with a slower, more capable model. Design space is huge here, so not part of this proposal.

Voice synthesis

Even simpler than transcription:

  1. Nabu Cloud.
  2. Piper on the workstation.
  3. Microsoft Sam Piper on the miniserver.

I’m giving my +1 for this, although for me personally I’d already be satisfied with the option to set a single fallback for Speech-to-text, to automatically be used if it fails, so I can use Speech-to-phrase as default for speed, but can still fall back to faster-whisper for the cases where it doesn’t suffice.