Basically, STT quality has kept me from switching to HomeAssistant's voice assistant features. The default matcher (Hassil) is waaaaaaay to strict, and LLMs are slow, costly, and/or a privacy nightmare, plus I don't like them.
I have posted about this on here before, and the situation has only slightly improved since then.
I really thought there would be something available that just matches your STT output to the configured intents, but apparently not, so I've built in myself.
Finally convinced my GF to throw Alexa in the bin ![]()
Here's an excerpt from the README, and feel free to AMA:
Problem statement and solution
Speech-To-Text (STT) output, especially fast and local STT output, is often simply bad.
HomeAssistant's own Hassil is incredibly picky:
your STT output must match exactly to one of the configured intents.
There's two paths forward from this: Upgrade your hardware to support better STT, or
try to figure out what the speaker probably meant to say from the garbled output.
This project does the latter.
With this custom integration, "Lights on in live in room" will actually turn on the lights in your living room.
So will, for that matter, "lighrts on inn livainriomm".
Short demo, first with closest-intent, then with bare Hassil:
Highlights
- Pattern expansion. Expanding
<expansion_rules>,(alternatives|to), and[optional|alternatives]all work, including on HASS-defined lists like your home's areas and entities! - Slot extraction. Both for wildcard slots (like for adding something to the shopping list, where the
{item}is a wildcard), and against slots like{timer_hours:hours}with a fixed set of possibilities. - Fuzzy slot resolution. For list-like slots and expansion rules (including your areas and entities!), fuzzy match the slot values to the available options. Allows "livikroom" to be corrected to "living room".
- Actual intent handling still done by Hassil.
closest-intentsimply corrects your STT output or typos to the closest matching intent, and then forwards a nice, canonical sentence to Hassil, who then deals with the intent just like if you had spoken/typed perfectly. - 100% LLM-free. Just uses relatively simple fuzzy matching of the input against your intents, plus some clever-ish (well... working, at least) tricks to improve the results.
- Fallback agent support. OK, I said 100% LLM-free, but if you absolutely want to, you can use one as fallback. More on this below.
- Is fast
(as in: basically instant for a couple hundred configured custom intents).
Note:
closest-intentis completely language-agnostic. All the examples in thisREADMEare in English, but you can use it with any language you like; personally, I use it in German.
Examples
Here's some examples of things I said, what my STT (wyoming-faster-whisper-base) understood, what HomeAssistant was able to do/answer after passing the STT output through closest-intent, and what the same STT output would have resulted in with just bare Hassil.
Note: These are actual results I got when speaking the "what was said" sentences in my phone.
I'm a native German speaker, and so I do have an accent, but this pretty closely matches my experience when using the German-language version of whisper.
The "bare Hassil" responses are what I got after 1:1 pasting the STT output into the voice assist chat window withclosest-intentdisabled.
| what was said | STT output | with Closest Intent | bare Hassil |
|---|---|---|---|
start cleaning |
Star cleaning. |
||
stop cleaning |
Stop clenching! |
||
vacuum the living room |
Vacuum Believing Room |
||
clean the office |
King the Office |
||
vacuum the kitchen |
Back here in the kitchen. |
||
how warm is it in the bedroom |
Our all is in the best room. |
||
add milk to the shopping list |
Add milk to the chauvinist. |
||
put call dentist on my todo list |
put call dentist on my tudu list |
||
turn on the water pump |
turn on the what her pump |
||
play some music |
Place on music |
||
resume the music |
Renew Music |
||
pause the music |
Post music |
||
next track |
next rack |
||
enable shuffle |
an able shuffling |
||
disable shuffle |
Disable to schaffen. |
||
restart the player |
Reset the plan. |
||
play a random album |
Player random album |
||
play a random artist |
Player and Immartist. |
||
play the latest tracks |
Plan the ladder tracks. |
||
play recently played songs |
Player recently played so... |
||
play playlist NieR |
Play playlist NEAR! |
||
play my daily briefing |
and play my daily breathing |
||
what time is it |
What the hell is it? |
||
what day is it today |
One day is today. |
||
make the tv brighter |
Make that CV brighter. |
||
set the screen darker |
The screen doctor. |
||
what's the weather today |
What's the matter with you? |
||
how's the weather tomorrow morning |
How's the better tomorrow? |
||
what's the weather this week |
What's the matter this weak |
||
how's the weather at 5 o'clock |
cast the red there at 5 o'clock |
||
how windy is it right now |
how windy is IR low |
||
how windy will it be tonight |
How will you be tonight? |
||
how hot will it get today |
How hard will it get today? |
||
will it rain today |
with it right today |
...you get the idea.
How it works
closest-intent is registered in HomeAssistant as a conversation agent.
On startup, it parses (by default) all user-defined intents (or optionally, also the builtins ones). In this process, it also expands all rules, like <expansion_rule>, (alternatives|to), and [optionals], and notes where {slots} are located, and whether they are wildcards or belong to some list (like areas, entities, or the numbers 1-100).
When a user request comes in (via voice command or the chat box), closest-intent fuzzy-matches that request against those expanded rules.
If the rule does not contain a slot, it is picked immediately.
If it does contain a slot, closest-intent performs a sequence of fancy magic steps to find the best-fitting slot value among a range of possible positions within the top-scoring matched sentences.
In practice, this often means "smallest slot-value on a word-boundary", but the extraction is not limited to that.
With the best match found, we then reconstruct the "canonical form", i.e. a sentence that Hassil will actually understand.
If in your configured intents, "Play some music." exists, and closest-intent got "Place on music" and matched that to the intent,
it will simply forward "Play some music." to Hassil. If the intent contained a slot, the extracted value will be substituted.
This guarantees that the sentence passed to Hassil will actually be understood, and allows us to not have to worry at all about performing actions, running scripts,...
If no matching intent could be found, we pass the exact input we got to the configured fallback agent.
By default, that is simply Hassil (which again allows us to be lazy and not worry about proper error responses), or another agent, like a LLM.

