@marisa - I’m using BigBobbas modified ESPhome code:-
It massively improves on the basic s3 box3 code and supports a timer and exposes a media player.
Like @dza though, I find that the box3 often locks up if it doesn’t quite understand what you are saying.
On Device wakeword detection is pretty good, just not quite there to be usuable by the rest of the family. You have to learn how to speak to it for it to be more reliable.
I found that pushing beam size to 5 made a huge difference, but I’m just guessing at the settings.
My latest issue is that my ESP32-Box has stopped responding to the wake word. I re-flashed it with the stock code from the ESPhome projects page to see if it would help, but nothing.
In the ESPhome setting page the wake word location selector is greyed out as ‘unavailable’ which seems like the only hint to what is going on.
I’ve been trying out a basic ESP32-S3 with INFP mic setup, and the biggest frustration I have right now is the STT errors. I’m curious if there’s anything out there like a simplified model that only recognizes a subset of words associated with an automated home? For example, I ask Jarvis to turn on the living room lamp, and it thinks I said “Turn on the learning room light”. The word “learning” will NEVER be used when I am vice controlling things around the house, so I’d love to not even have that as an option for the Whisper model to choose from.
Of course, with a general LLM, you need the full breadth if you’re going to ask it random questions, but that could be its own pipeline.
*edit Vosk STT seems like a pretty good option for limiting what can be recognized.