The vision
- I don't want a voice assistant. I want to talk to my house, and have the house be a frontier, vanguard LLM that reads my home's context and acts on it.
- There is no "wake word". Voice input wakes up with an active human interaction - in our case a tap on the "ball" (see bottom for images of the ball).
- Assist is just the only tool I have at hand to get audio in and out. The brain is the point; Assist is the pipe.
The architecture (all local except the LLM brain)
[ball] ESP32-S3 XiaoZhi, tap-to-talk
│
▼
[HA Assist] pipeline glue
│
▼
[Whisper · STT] whisper.cpp large-v3-turbo, Metal, Greek
│
▼
[casa-llm · ROUTER] classifier → picks the route
│
▼
[casa-llm · EXEC] agentic tool-loop → calls HA services
│
▼
[Piper · TTS] Wyoming Piper, Greek voice
│
▼
[ball] speaks the reply
Casa LLM
My own dockerised dispatcher. A custom HA conversation agent (conversation.casa) POSTs every transcript to it. Its architecture in brief:
-
Incoming message lands in the router. A Gemini 2.5 Flash whose only job is to classify the inbound message and route it to one of three roles:
a. system. the house brain, full authority.
- Model: Claude (Sonnet 4.6 on voice, for latency).
- Does: system specific work, can brainstorm around the home automation system, plan and implement, effectively improving the automation system.
b. ha. Controls the house
- Model: Gemini 2.5 Flash, run as an agentic tool-loop.
- Does: the actual "turn the lights off / is the CO₂ high?" work.
c. chat. the escape hatch for everything else.
- Model: OpenAI Codex (
gpt-5-codex). - Does: open-ended Q&A and anything that isn't house control or config.
What I'm actually chasing: real push-to-talk
- The UX I want: tap once → it listens for as long as I talk, pauses and all → tap again to stop. No wake word, no press-and-hold.
- I know this isn't how Assist is meant to work. Assist wants end-of-speech detection. I genuinely don't care. I care about the experience I'm building, not the tool's intent.
What I tried - and excluded
ESPHome voice_assistant.start(silence_detection: false)as a single-tap toggle, perfect on paper.
The wall: HA's Assist pipeline runs its own server-side STT VAD (assist_pipeline/vad.py) and still ends the turn on a pause.
Press-and-hold. Ruled out, hard. Not the UX I want.
"Just make the VAD longer", core VAD isn't cleanly per-pipeline tunable, and it's known-flaky.
I even rolled our my hack on HA Core to turn off VAD but the state machine of ESPHome assistant broke down with that.
The ask
Anyone else on the same boat as me?
Your thoughts on the topic are very welcome.
(Photo of the ball
)
