Voice Chapter 11: multilingual assistants are here

Saw live-stream, some ideas:

about Confirmations:

  • I think it should be configurable
  • all my devices are in one area, for different devices I would want to have different types of confirmations

Configurable fuzzy match options:

  • Strict match
  • Very Strict match
  • Fuzzy match

Would love to also hear more about voice hardware roadmap with future reference hardware plans for both ESPHome on ESP32 and Linux Voice Assistent on ARM64 satellites from Open Home Foundation and partners. Will you for example be making a matching official product for Linux Voice Assistent that matches the Home Assistant Voice Preview Edition as reference hardware?

Would it be a good idea to have voice satellites with a fixes microphone board and a modular ā€computeā€ (a.k.a. ā€coreā€) board like they have in the FutureProofHomes Satellite1 modular design concept?

That is, make two swappable SoM (System-on-Module) ā€computeā€ boards, with one SoM-board based on ESP32 and one SoM-board based on a powerful ARM64 SoC (similar to Raspberry Pi Zero 2 W)?

Maybe base the ARM64 SoM compute-board on a SoC with built-in NPU that is powerful enough to off-load Speach-To-Text to free up resources from other tasks.

Perhaps even make the ESP32 SoM compute-board with multiple ESP32 chips on the same board to off-load the communication tasks and also make it work as a Thread Border Router (if could combine an ESP32-S3 with an ESP32-C6).

Also wondering if you can test if inexpensive AI hardware accelerator hardware could run a small LLM?

Ill stop you here… This does not exist.

The smallest GPU (yes you you need one in LLM land - not optional… vram is your limiting factor) you can hope to use and have a decent experience is something better than a Nvidia 3xxx with at LEAST 8GB vram. Preferably 16 or more.

That puts you MINIMUM $800 usd. Probably in the low $1000’s

1 Like

2025.11. timer_command:conversation_command still only works for the standard ā€œHome assistantā€ agent.

Does anyone know if they will add a variety or more new wake words in the next version?

I would not expect them to. They have three (four if you include the stop) perfectly good (even if we don’t like them) copyright clearing terms already and have documented how to make your own.

If they build a new one and run afoul of copyright or service marks or trademark or… They get sued because they’re a business entity.

If you do it… It’s your install… Have fun. Maybe the copyright owner comes to tell you to stop but HA isn’t sued out of existence by just shipping something.

Given those choices if I’m your Dev PM I won’t LET you make more three is fine we have bigger fish to fry. You have satisfied the build requirements and prevented scope creep and lawsuit. I call it a win. Now am I in that room and do I KNOW they make that decision… No. But coming for the perspective of people who’ve has to make decisions like that. I wouldn’t hold my breath… (probably not)

2 Likes

I am fine with just a single ā€œok nabuā€, it just should work as good as ok google :slight_smile:

1 Like

Finally, I can see a route to moving away from Alexa/Plex. Lighting and music are my primary use cases (with heating, at least logging, being the next most relevant). LMS has replaced Plex now that I have some stand-alone wireless speakers (and I can use about 3 more squeezelite players).
I’m not too worried about the performance of local acceleration, I expect models and hardware to converge soon enough as the next ASIC iterations start coming to market.
Immediate priorities for me are phrase/context accuracy, and music library handling - which both seem to be in hand.