Embedding based Voice Assistant Pipeline

Today I rebuilt my Home Assistant voice pipeline to avoid sending the full entity list to an LLM on every request.

With ~1k entities and ~400 automations, the default LLM-based approach becomes slow and token-heavy. Even with local models (Ollama), each prompt includes the full context dump and relies entirely on probabilistic intent matching.

I wanted deterministic execution for common commands, with the LLM as fallback — not the primary brain.

Architecture

  1. pgvector tier (deterministic layer)
    All entities, scripts, and automations are embedded into a Postgres + pgvector database.
    Incoming intent is matched against embeddings to find the closest entity + action pair.
    No LLM involved at this stage.
  2. Intent splitter tier (small local LLM, ~1B)
    Handles compound commands like:

“Turn off the bed lamp and living room lights, then restart the coffee switch.”
Its only job:

  • Split commands
  • Classify HA actions
  • Detect sequencing (parallel vs sequential)
  • Separate non-HA requestsIt never sees the full entity list.
  1. Non-HA routing tier
    If part of the request falls outside HA, it gets routed to a smarter model for web search / reasoning.

Execution Model

  • Independent commands execute in parallel
  • Dependent commands use a 500ms debounce “wave” logic
  • Non-HA requests are returned in a non_ha field

Results

  • ~95% of HA commands execute under 500ms
  • 6MB C++ binary
  • Runs comfortably on a Raspberry Pi 5
  • LLM is fallback, not dependency

Next step is packaging this as a Docker image and wiring it as the brain behind a Wyoming-based voice satellite (ReSpeaker Lite).

Happy to share more implementation details if there’s interest.

3 Likes