An issue I have is trying to get the example from the release video to work.
I had posted a comment about it here but it isnāt getting any traction so thought Iād spam it here tooā¦ The compiler errors wanting esp-adf instead of esp-idf despite esp-adf not being a valid option (and also fails if tried)
My code is straight from the example given on gitā¦ no idea what else to do.
EDIT: found out I had a couple of lines of code missing. All good now.
I think the lack of follow-up ability. When you start using LLM/GPT3/4, and it asks for more info or a followup, youāre out of luckā¦ (I did hear someone say to just say the wake word and answer, but not sure that works).
Really need local processing for what HA can do and have a fallback option if HA doesnāt know, then let another assistant pipeline workā¦ Meaning, I want to control everything in HA locally, but also use it for conversational things.
If you donāt mind custom components, there are solutions for this.
I made one for example:
As for the point of this thread, my second most used feature of Alexa isnāt available out of the box, and thatās a problem for me, and thatās setting timers and reminders using voice.
I have implemented timers manually, but it was a hassle to set up, and having it built in would be nice.
Using the voice assistant on wearOS is very hit or miss. It rarely gets anything wrong but sometimes picks up background noise like a TV. Also, it does very poorly at recognizing when the phrase has ended and continues to listen way longer then needed.
Itās still a very impressive feat and looking forward to it being more polished in the future!
My voice commands are often not understood, it works much better with the Alax. I hope this will be better in future, I wanāt to get Alexa out of my house.
I would like to be on the positive side of the s3-box3. I am using ābubbasā firmware, with Marissaās āfall back conversationā (HA ā openai gpt3 turbo) and it really has helped. The fall back helps specifically due to openai struggling with basic tasks. The firmware also opens up the speaker instead of the stock sound level which was very quiet.
I decided to completely ditch the concept of voice control because itās simply unviable. Have tried different hardware (computer microphone, headset, ESP32 I2S), different ASR backend (Whisper, vosk, rhasspy) and different language (English, Chinese) and below is the constant result. I have even tried official Amazon Echo and itās having trouble distinguish āOnā and āOffā. The conclusion for me is that voice control worth nothing and itās just MUCH MUCH quicker to take out the phone, open the companion app and click a button.
@marisa - Iām using BigBobbas modified ESPhome code:-
It massively improves on the basic s3 box3 code and supports a timer and exposes a media player.
Like @dza though, I find that the box3 often locks up if it doesnāt quite understand what you are saying.
On Device wakeword detection is pretty good, just not quite there to be usuable by the rest of the family. You have to learn how to speak to it for it to be more reliable.
I found that pushing beam size to 5 made a huge difference, but Iām just guessing at the settings.
My latest issue is that my ESP32-Box has stopped responding to the wake word. I re-flashed it with the stock code from the ESPhome projects page to see if it would help, but nothing.
In the ESPhome setting page the wake word location selector is greyed out as āunavailableā which seems like the only hint to what is going on.
Iāve been trying out a basic ESP32-S3 with INFP mic setup, and the biggest frustration I have right now is the STT errors. Iām curious if thereās anything out there like a simplified model that only recognizes a subset of words associated with an automated home? For example, I ask Jarvis to turn on the living room lamp, and it thinks I said āTurn on the learning room lightā. The word ālearningā will NEVER be used when I am vice controlling things around the house, so Iād love to not even have that as an option for the Whisper model to choose from.
Of course, with a general LLM, you need the full breadth if youāre going to ask it random questions, but that could be its own pipeline.
*edit Vosk STT seems like a pretty good option for limiting what can be recognized.