Voice assisted speaker/monitor HW

Hi Guys,
I have never been a big fan of voice assisted speakers/devices like alexa or nest (although I have nest enabled via google home and integrated HASS).

Given that the year of voice is almost over now, and great progress has been done I am trying to look around and see what HW I can buy to try it.

I looked at ESP32-BOX-S3 which seems very intereting

it’s a bit expensive for an experiment so I was thinking of a similar thing but cheaper, maybe.

I see that the discontinue ESP32-BOX is still around, and even if “lightweight”, it’s cheaper.

I am looking for a mike+speaker system which has possibly a monitor as well.
It does not need to be a touch screen, or really have a monitor, but it would be cooler.
my main objective so far is experimenting without spending much, with the idea that the device will continue to work as long as possible with HASS.

Yes, at the moment I don’t have the time to solder/DIY/play with electronics. I have my breadboard sitting for years in my drawer with several sensor and cool stuff, unused.
I need a pre-built system. I can spend a bit of time to hack once to make it work (firmware update/etc), but not bulid or modify something and even less to learn how to do it (unfortunately).

Given that, I have few questions here:

  • in terms of HW, what are the alternative to the S3-BOX or S3-BOX-3?
  • for the S3-BOX is this the best repo still? it seems untouched for a while
  • can I reuse a nest, modify it/flash it, so it does not use google assistant anymore, but can go directly to HASS and acts de facto as a spekear/mike voice assistant system, but with my assist pipeline?

edit: looking at wyoming-satellite with a very old (first edition or so) rpi, o a rpi zero w, might it be a solution for now?

thanks!

I just bought a used Jabra Speak 410. It seems to work very well as a microphone and speaker. My HA server is in the bookshelf of my living room and I connect Jabra directly to it. The microphone covers the whole 4 x 5 meter room easily. Unfortunately, the Nabu Casa cloud speech to text causes a too long latency to be usable - or it could be any wait in the pipeline. I cannot identify which, but I suspect it is Nabu Casa.

I still need to figure out how things works and connect to each other, properly. but the nabu casa issue can be verified by using a local pipeline, I think

Update: I ended up trying with a rpi zero 2 w and a respkear 2-mic phat, following the tutorial for wyoming-satellite

It works, with the exception that the STT model is not perfect (as it does not understand any of the languages that I speak, even the native ones), but good enough for non production use