A starting point is GitHub - sle118/squeezelite-esp32: ESP32 Music streaming based on Squeezelite, with support for multi-room sync, AirPlay, Bluetooth, Hardware buttons, display and more which is the github repo for this software. It’s not quite a simple step-by-step process, though if you wire up pins to an ESP32, you’ve got over one major hurdle. Where it can get tricky is if you want to connect switches and have Squeezelite-esp32 act on those for play/pause, skip forward/back, etc. You have to compose a JSON blob and upload it to the running squeezlite esp32 todefine those pins and functions. But you can do that after you get the basic music-playing capability working.
You should decide at the outside if you plan to have Music Assistant (MA) installed and part of your world. There’s many good reasons to do this, and going down this path means that MA automatically pause music that’s play, then play the TTS announcement and then resume. And you can configure MA to include a litlte “ding” noise before it plays the announcement. If you go down this path, then you can expose the MA devices into Home Assistant as media players as the interface.
Alternatively, you can add the Squeezebox-lite devices as Home Assistant “Squeezebox” media players directly, but then you don’t get those other nice features.
Note that if you go down this path, then the device is not going to be a Voice Assistant as it has no microphone, etc. Just a music/media player.
If you start with that github link and then do some searches, chances are you might turn up something more like a tutorial? The documentation on the github page is pretty complete, though it can be a little confusing at the outset until you get the feel for how it wants to be installed, configured and used. Like most things…
EDIT: To be fair, it’s been more than few months since I went down this path, so possibly the music playing experience with ESPHome (as compared to the squeezebox lite approach) has improved. The voice assistant and media playing code has been under active development, so what I experienced 6 months ago is not the same thing you’ll see today.
On the other hand, if you have shitty speaker acoustics, it really doesn’t matter what software is driving it. So what you invest in the speaker and enclosure will pay dividends regardless of what software stack you choose. Heck, try them both once you get the hardware working and see how each feels!