Improve robustness of intent handling by considering "closeness" to defined sentences

Note: I am sorry if this has already been suggested, discussed, and dismissed in the past, but I could not find a discussion on this either here on in the Github issues.

Motivation

Home Assistant’s Voice Assistant functionality is fantastic, but it has a fatal flaw when used in conjunction with a Speech-To-Text engine: unless the STT process is flawless 100% percent of the time, frustration occurrs when the dreaded “Sorry, I didn’t understand that” statement is received in response to an, on the surface, well-formed request.

For a concrete example: I have a List called “To Do”.

If I tell HA (in writing or through speech/STT) to

  • “add Vacuuming to my To Do list”

it does just that. However, if I tell it to

  • “add Vacuuming to my To-Do list”
  • “add Vacuuming to my To-do list”
  • “add Vacuuming to my ToDo list”
  • “add Vacuuming to my Todo list”
  • “add Vacuuming to my To Do-list”
  • “add Vacuuming to my To-Do-list”
  • “add Vacuuming to my Todolist”
  • “add Vacuuming to my to do list”

it does not. I would argue that even when typing a sentence like this, it can’t always be expected from me to 100% remember the name of everything, but that is kinda beside the point.

The annoying thing is that I am at the mercy of the STT engine.

It is damn-near impossible to influence whether it will interpret | tu du: | as “To Do”, “To-Do”, “ToDo”, or even “To do”! All of the examples listed above were received as the result of playing around with an STT engine for about 30 minutes. In this time, I maybe had a 5% success rate to get HA to set somthing on my To-Do, erm, To Do list.

LLMs are not the solution

I know that HA has just received an update allowing fallback-handling of failed voice commands by an LLM. That’s great, but in many cases not an option (even just considering the hardware requirements!). Additionally (and I don’t know if there’s an English idiom for this, so I will use a German one) this is like “using cannons to shoot sparrows”. It’s just overkill.

Proposed solution

Add an option similar to the “fallback to an LLM” option, but for a simple distance-based measure. For example, the well-known Levenshtein distance could be used to measure not if the input matches a sentence in the predefined list of sentences, but how closely it matches.

If no exact match is found, pick the closest match (with a configurable cutoff). This would make controlling HA through voice significantly more robust.

Issues with this approach

Levenshtein is not exactly fast. I don’t know if this is an issue in practice, but it very well might be. In this case, it might be beneficial to, if possible, automatically pre-compute, and/or cache non-perfect matches.

Alternatives

At the very, very least, if the converted-to-lowercase input with all spaces stripped matches a converted-to-lowercase, all-spaces-stripped sentence, that should be a successfull match.