Hello everyone,
I’m currently planning my KNX/HomeAssistant-based smart home system for my new home, and I’ll definitely have many more questions ![]()
Right now, I’m dealing with what is, from my perspective, probably the most complex topic: audio control and multi-room audio.
Here’s what I mean and what I want to achieve:
- No cloud connection or other forms of obsolescence
I work in software development (Garbage-Collector-based languages) and also have some experience with hardware (e.g., flashing). So I’m completely fine with a tinkerer’s solution - I would actually prefer it, because then I know how everything works and can replace components if needed, regardless of whether the manufacturer still exists.
- Voice control and voice output
I want to equip each room with a microphone and music-ready speakers. For voice control, I don’t need anything “smart” like Alexa - fixed phrases are completely sufficient.
- “Carrying” a conference with me
I want to connect my laptop (or smartphone) in the office, for example, via the mic/headphone combo jack or via Bluetooth, and use the installed microphone and speakers in the room. During a Teams meeting, I want to be able to move to the living room and switch the transmission there via voice command. I should still be able to issue voice commands (ideally commands would be recognized via the local microphone - not from the Teams participants
). Echo/feedback loops should also be avoided.
With a Bluetooth connection, I’d also like to control the media player (“next track”, volume, etc., e.g., via AVRCP).
- Multi-room
Similar to point 3, I want to play or “carry along” music across multiple rooms. Synchronization would be ideal, but not strictly necessary.
- Context
The system should “know” who is talking to it and where. Since each of us has different preferences, shorter phrases would make sense. For example, each person could have their own wake word, and depending on which microphone the command comes from, different routines would be triggered.
If I say “lights on” in the bathroom, it should turn on to 30%, whereas my daughter might need 80%. Since presence sensors are planned anyway, automatic “tracking” of lights, music, etc. would also be interesting, though not required.
- Power efficiency
A very low idle power consumption would be highly desirable.
That’s my wishlist for now, and I’m aware that this is technically ambitious. I assume there is no ready-made solution for this at the moment - I certainly haven’t found anything comparable so far.
So far, I’ve mostly looked into Raspberry Pi. But given the power consumption and the fact that the ESP32 apparently has everything I need, I could imagine doing this with ESP32 and ESPHome. I’m not familiar with it yet, though.
Here’s what I’m currently considering:
- Installing 1-2 microphones with ESP32 in each room.
What microphones would you recommend? At the moment I’m mainly considering two options:
a. reSpeaker XMOS XVF3800 with XIAO ESP32S3
The range is specified as 5 meters. If I place it on the ceiling, the radius is of course smaller. Alternatively, wall mounting - probably better for conferencing, but worse for controlling things from bed.
I find the integrated ESP32-S3 particularly interesting because it means I need fewer additional components.
I don’t yet know whether the quality is sufficient for conferencing.
b. Something like the Anker PowerConf S500
This would be the premium option. It also has a 5-meter range, but you can pair two devices if needed. I don’t know how much power it consumes or whether it can be connected via USB to an ESP32. I could consider routing the assistant’s voice output through it.
The ESP32 would probably transmit audio over Wi-Fi to a central server.
- Equipping each room with 2 speakers
I would connect a 2x ~50 W class-D amplifier to the ESP32 I2S. The integrated ESP32 DAC seems unsuitable for this.
- Connecting laptop or smartphone via 3.5 mm jack or Bluetooth
Ideally, I wouldn’t need a dedicated ESP32 just for this, but could use one of the units mentioned above.
- Stream and media control handled by the central server
The question, of course, is how to route audio between the ESP32s and the server. I see there are WebRTC projects. Would UDP or TCP be possible and maybe simpler? And what about echo/feedback? And I’d need to be able to send media commands to the laptop/smartphone.
I’m also unsure whether the quality after multiple transcoding steps would be sufficient for speech and music.
If all of this works in principle, integration with Home Assistant and KNX should be relatively straightforward.
- Smart functions
There are many open-source libraries for wake words, speech recognition, and text-to-speech. I can’t yet judge which of them are fast and good enough. In the setup described, a central server would have somewhat more processing power.
Overall, I would need 7 to 10 satellites.
What do you think about this? Are there alternatives? Could it be simpler? Which concrete components should I use? Are there other forums I should ask in? I’m open to anything.
Thanks in advance!
Anton
