There are a few things going on here.
First the TTS (whisper on this case)
This disassembles the audio and returns a best approximation of translation to language. It’s pretty lightweight and can easily be run locally. This is skipped entirely (as is STT) if you are typing in a char window like in the mobile client.
This is passed to whatever you’re using for recognition. This is where things can turn. You are either doing this ‘locally’ according to HA which just means handled by the HA speech intent recognition OR not locally - by the LLM.
For local recognition to happen it passed the sentence to the HA recognition and if it matches an intent it funnels it off to that intent. Done. Ha does its thing and returns a response. See STT later… Note at this stage the HA parser is formulaic. If you don’t specify what it expects in your sentences or if you didn’t catch an idiosyncratic form. It won’t match… It doesn’t figure stuff out.
You can also setup fail over or fallback. Where if the HA recognizer doesn’t match an intent it falls back to whatever llm is setup for that pipeline. It’s chucked at the LLM
If you either chose to skip local or fallback, your request is now with the llm. This is where LLM has a distinct advantage. It does pattern recognition of what was fed in and tries to find a ‘tool’ (those same scripts and intents local sees) and match up a call. (this is why if you choose to use an llm you should choose one optimized for ‘tool use’, see also why I don’t understand any LLMs attached to HA that can’t ‘operate HA.’ I also recommend a straight LLM without chain of thought or reasoning at this moment [gpt4.o-mini not o.1, etc.] but that’s another post.) It does NOT need to be an exact string match the LLM can figure out what you meant. LLMs are EXCEEDINGLY good at this part. So you often get WAY better recognition of phrases and ‘what I meant’. IF you do a good job of describing the conditions and rules to the AI. More on that in another post.
Builtin - pattern match only, is lightweight and runs on anything you can run HA on. Read: gets the job done but is very VERY unforgiving.
LLMs - WAY better result pattern matching because it can infer intent, has heavier requirements, either a service (chatgpt etc.) or setting up your own local LLM infrastructure (ollama+open-webui+open router is common) read: time, energy, money for better results.
If the LLM finds a match it assembles the correct JSON to set off the intent and throws it at the intent recognizer as a tool call. Essentially back in the same place I the pipeline you would have been with a local only call but it did a better job at matching. HA does it’s thing and then sends back a response.
In LLM land that response is interpreted and sent to STT. It will be a summary of the response (and where things get creative) whereas with no LLM you get the verbatim response. Your LLMs impact is felt here the most because it has control of what’s ultimately fed to be voice or your readout… It’s also able to use any data in the conversation pipeline (think continuation) or the initial prompt
That response is fed to your STT engine (probably Piper or the web STT engine) and back to you.
Flow is basically:
STT/Text > recognizer (local/llm) > tool > response > (llm option) > TTS/Text