You wishlist is to ambiguous. Maybe 3 or 4 year later it can be done but now not yet. Get your wishlist sharp and clean first. My advise would be to first investigate each bullet on your list and see what it possible. And what is not. Put the results of each bullet next to each other and you see your gap in knowledge and the tech gap at the moment. Next fill those knowledge blackholes.
At the moment there is a lot of development in synced audio streaming and voice. What is hot at the moment is old hardware in 6 months time.How likely is it that there are replacement parts available over say 5 years from now.
You mentioned that you don’t mind to do some hardware and software. For whole house audio there are a lot of options both bought and diy. I have 10 rooms wired here and have build some experience with it over the last 15 years. I do not say it is the best option (for me it was) but i just want to give you some ideas to look at. Just to make you knowledge wider. (shameless plug following)
Look into a XAP800 device. Cheap, easy to get and not to difficult to use.
Want to make something with an ESP32 then have a look at this ZMC 5.0
VoicePuck - One of the many voice assistant variants
And please look also into other options because they may be better suited