Today I rebuilt my Home Assistant voice pipeline to avoid sending the full entity list to an LLM on every request.
With ~1k entities and ~400 automations, the default LLM-based approach becomes slow and token-heavy. Even with local models (Ollama), each prompt includes the full context dump and relies entirely on probabilistic intent matching.
I wanted deterministic execution for common commands, with the LLM as fallback — not the primary brain.
Architecture
- pgvector tier (deterministic layer)
All entities, scripts, and automations are embedded into a Postgres + pgvector database.
Incoming intent is matched against embeddings to find the closest entity + action pair.
No LLM involved at this stage. - Intent splitter tier (small local LLM, ~1B)
Handles compound commands like:
“Turn off the bed lamp and living room lights, then restart the coffee switch.”
Its only job:
- Split commands
- Classify HA actions
- Detect sequencing (parallel vs sequential)
- Separate non-HA requestsIt never sees the full entity list.
- Non-HA routing tier
If part of the request falls outside HA, it gets routed to a smarter model for web search / reasoning.
Execution Model
- Independent commands execute in parallel
- Dependent commands use a 500ms debounce “wave” logic
- Non-HA requests are returned in a
non_hafield
Results
- ~95% of HA commands execute under 500ms
- 6MB C++ binary
- Runs comfortably on a Raspberry Pi 5
- LLM is fallback, not dependency
Next step is packaging this as a Docker image and wiring it as the brain behind a Wyoming-based voice satellite (ReSpeaker Lite).
Happy to share more implementation details if there’s interest.