Audio control, conferencing and multi-room audio

Hello everyone,

I’m currently planning my KNX/HomeAssistant-based smart home system for my new home, and I’ll definitely have many more questions :blush:

Right now, I’m dealing with what is, from my perspective, probably the most complex topic: audio control and multi-room audio.

Here’s what I mean and what I want to achieve:

  1. No cloud connection or other forms of obsolescence

I work in software development (Garbage-Collector-based languages) and also have some experience with hardware (e.g., flashing). So I’m completely fine with a tinkerer’s solution - I would actually prefer it, because then I know how everything works and can replace components if needed, regardless of whether the manufacturer still exists.

  1. Voice control and voice output

I want to equip each room with a microphone and music-ready speakers. For voice control, I don’t need anything “smart” like Alexa - fixed phrases are completely sufficient.

  1. “Carrying” a conference with me

I want to connect my laptop (or smartphone) in the office, for example, via the mic/headphone combo jack or via Bluetooth, and use the installed microphone and speakers in the room. During a Teams meeting, I want to be able to move to the living room and switch the transmission there via voice command. I should still be able to issue voice commands (ideally commands would be recognized via the local microphone - not from the Teams participants :smile:). Echo/feedback loops should also be avoided.

With a Bluetooth connection, I’d also like to control the media player (“next track”, volume, etc., e.g., via AVRCP).

  1. Multi-room

Similar to point 3, I want to play or “carry along” music across multiple rooms. Synchronization would be ideal, but not strictly necessary.

  1. Context

The system should “know” who is talking to it and where. Since each of us has different preferences, shorter phrases would make sense. For example, each person could have their own wake word, and depending on which microphone the command comes from, different routines would be triggered.
If I say “lights on” in the bathroom, it should turn on to 30%, whereas my daughter might need 80%. Since presence sensors are planned anyway, automatic “tracking” of lights, music, etc. would also be interesting, though not required.

  1. Power efficiency

A very low idle power consumption would be highly desirable.

That’s my wishlist for now, and I’m aware that this is technically ambitious. I assume there is no ready-made solution for this at the moment - I certainly haven’t found anything comparable so far.

So far, I’ve mostly looked into Raspberry Pi. But given the power consumption and the fact that the ESP32 apparently has everything I need, I could imagine doing this with ESP32 and ESPHome. I’m not familiar with it yet, though.

Here’s what I’m currently considering:

  1. Installing 1-2 microphones with ESP32 in each room.

What microphones would you recommend? At the moment I’m mainly considering two options:

a. reSpeaker XMOS XVF3800 with XIAO ESP32S3

The range is specified as 5 meters. If I place it on the ceiling, the radius is of course smaller. Alternatively, wall mounting - probably better for conferencing, but worse for controlling things from bed.

I find the integrated ESP32-S3 particularly interesting because it means I need fewer additional components.

I don’t yet know whether the quality is sufficient for conferencing.

b. Something like the Anker PowerConf S500

This would be the premium option. It also has a 5-meter range, but you can pair two devices if needed. I don’t know how much power it consumes or whether it can be connected via USB to an ESP32. I could consider routing the assistant’s voice output through it.

The ESP32 would probably transmit audio over Wi-Fi to a central server.

  1. Equipping each room with 2 speakers

I would connect a 2x ~50 W class-D amplifier to the ESP32 I2S. The integrated ESP32 DAC seems unsuitable for this.

  1. Connecting laptop or smartphone via 3.5 mm jack or Bluetooth

Ideally, I wouldn’t need a dedicated ESP32 just for this, but could use one of the units mentioned above.

  1. Stream and media control handled by the central server

The question, of course, is how to route audio between the ESP32s and the server. I see there are WebRTC projects. Would UDP or TCP be possible and maybe simpler? And what about echo/feedback? And I’d need to be able to send media commands to the laptop/smartphone.

I’m also unsure whether the quality after multiple transcoding steps would be sufficient for speech and music.

If all of this works in principle, integration with Home Assistant and KNX should be relatively straightforward.

  1. Smart functions

There are many open-source libraries for wake words, speech recognition, and text-to-speech. I can’t yet judge which of them are fast and good enough. In the setup described, a central server would have somewhat more processing power.

Overall, I would need 7 to 10 satellites.

What do you think about this? Are there alternatives? Could it be simpler? Which concrete components should I use? Are there other forums I should ask in? I’m open to anything.

Thanks in advance!

Anton

You wishlist is to ambiguous. Maybe 3 or 4 year later it can be done but now not yet. Get your wishlist sharp and clean first. My advise would be to first investigate each bullet on your list and see what it possible. And what is not. Put the results of each bullet next to each other and you see your gap in knowledge and the tech gap at the moment. Next fill those knowledge blackholes.
At the moment there is a lot of development in synced audio streaming and voice. What is hot at the moment is old hardware in 6 months time.How likely is it that there are replacement parts available over say 5 years from now.

You mentioned that you don’t mind to do some hardware and software. For whole house audio there are a lot of options both bought and diy. I have 10 rooms wired here and have build some experience with it over the last 15 years. I do not say it is the best option (for me it was) but i just want to give you some ideas to look at. Just to make you knowledge wider. (shameless plug following)

Look into a XAP800 device. Cheap, easy to get and not to difficult to use.

Want to make something with an ESP32 then have a look at this ZMC 5.0

VoicePuck - One of the many voice assistant variants

And please look also into other options because they may be better suited

1 Like

I have NUCs scattered around the house that are tied together with MPD and Snapcast. My HA instance has mpd daemon add-on installed and I can control my media server through HA from almost anywhere in the house. I don’t use voice command, but plenty here do and will likely be able to help there.

1 Like

Possibly look into and/or experiment with a voice preview assistant from nabucasa for the satellites, they are esp based and have a speaker/mic, and a headphone jack to output to external audio. They are open and can be modified via esp and/or hardware hacking.

Also check out music assistant, which lets you send audio to connected media players, create groups of like or different players.

1 Like

I was originally planning to poo poo your idea but I like it too much. Don’t understand its purpose completely and see dragons here but fun idea.

I’m thinking this is two seperate system. Mic input system and audio out. End of day voice device is mic input and audio out. Nothing says you cannot send audio to another device or receive audio from other device. If you can select the mic source independent of audio out source you can achieve your goal

Look into resonate protocol.
It is being added to esphome and may be useful for this

1 Like

Thank you for your pointers! What do you mean by ambiguous exactly? Basically, I want to route duplex audio from one satellite to another and have some sort of assistant - not necessarily AI-based. With these building blocks, I can do anything I need.

From point of view, this comes down to a few challenges:

  1. Picking up quality audio without echo from a microphone, 3.5 mm jack, or Bluetooth, and streaming it to the server.
  2. Playing quality audio from a stream coming from the server.
  3. Having something in between on server to route, combine, and process the streams.

I’m not sure what hardware you are referring to. Of course, components like microphones, DAC/ADC, and amplifiers need to be of decent quality.

I have no experience with the ESP32, but honestly, I don’t care whether the system is ESP32-based or something else. What I care about is long-term cost and quality. That’s why I’m asking whether the ESP32 can help me achieve my wishlist. At the moment it seems like a perfect fit, but probably with a lot of effort.

The XAP800 needs 30w, and I don’t think the audio necessarily needs to be mixed by an external device. That’s why my idea is to route everything to the server and handle it there.

The problem is that I need full two-way audio for conferencing, and the music must come directly from a laptop or smartphone rather than from a central device.

Thanks! I looked at HA Voice a few months ago, but mainly from the voice-assistant point of view, and the reviews weren’t great.

I see that it has a very reasonable DAC/ADC, though, and I think it could be worth exploring in more depth :ok_hand:

Thanks! If I understand correctly, it’s not a duplex protocol. There are already some protocols out there (I’m not sure which of them are supported by or can be implemented on the ESP32), so the only real USP would be the ESP32-based implementation.

Using WebSocket binary messages for audio is quite brave. I tested them for video some time ago, and the performance was really poor compared to WebRTC in terms of quality and latency. It might work for audio, but I think it will impose some limitations on bitrate.

As I mentioned, playback synchronization would be nice, but that requires buffering, and I haven’t seen any duplex solutions. Buffering in particular could be a K.O. criterion for conferencing use cases. Otherwise, I could try running both the server and client on the same device to simulate a duplex connection. I’m not sure that would be supported, but using two devices would also be OK. I see your point :handshake:

I will just add to the discussing.

I am aware that there is a long running thread on this forum to make this functionality between 2 esp32’s. It is going slowly but i believe they are getting somewhere. I am not following it anymore.
But the problem with duplex audio on one device is that the mic and speaker are very close together. Then it will be mostly a push to talk operation.
The {esphome} resonate protocol is also mentioned. It is still in dev fase but it looks promising.

I Have tested many mics for my own system and any high quality mic will not have a 3,5mm jack. The mic need to be as good as possible so the audio you start with is as clean as possible. I ended up with using the samson cm11b but that was quiet a few years ago. There must be better options today.

But the quality of your audio need to be on par with for what you are using it. Background music and an occasional announcement do not need top line hardware. listening to studio level audio is an other case and require something better then a esp32 or Rpi. So you need to figure out what will suite for you. Having said that, the ESP or RPi can deliver surpisingly good audio for it price.

This is a hard one if you look at the life cycle of these devices. New and improved hardware is arriving all the time. The esp8266 was the builders dream but now they get an multi core esp32 for the same task. Just because you get better hardware for about the same price. This is even more important for voice hardware.

A server will also consuming about that amount of power.
I am just pointing this option out because not many people are aware of it. I had a 10 room whole house audio system running on it for 15 years. Including mics in 5 rooms. I was able to make and take (landline) phone calls and use it as an intercom between any or many rooms.
For your information.
The XAP 800 is a conference mixer like you are looking for. It does gating on mics, route up to 96 inputs to any up to 96 outputs, Can do filtering and do many other audio processing and it is able to noise canceling. Meaning, you can have your mic next to the TV playing your favorite show and still the mic picks up only your voice while canceling out the TV.

1 Like

Found this topic: Home assistant voice preview edition weak mic? . Until it becomes clear whether the issue is in the hardware or the software, HAVPE is a no-go for me, unfortunately.

Thank you very much for the pointer! It took some time, but I found it: Is there a way to stream audio from one ESPHome to another? . I think this is exactly base what I want.

For better understanding, I’ve drawn a diagram of my current goal, which I think will satisfy my needs:

Using a built-in speaker is not an option for me, so I’m planning something like a pair of Teufel Ultima 20 or similar in each room.

Since I don’t think I need syncing, I’m not sure it will help me. I think it might even be harmful for a conference scenario.

The jack is only for connecting a laptop. I hate wires, but the jack is used all the time because it’s the fastest option to plug in and unplug.

I don’t think I need studio quality (and studios require purity), but I’d like it to be on par with consumer-grade devices from Yamaha or at least Onkyo.

I think the price of the ESP32 - both initial cost and operating cost - is great. It seems like it would be much more expensive if it were based on an RPi.

Thanks! However, I need a server anyway for Home Assistant, storage, and some experiments. Therefore, it won’t incur any additional cost, except some extra power consumption for STT and graphic card.

Linux based Voice Assistant
successor to wyoming satellite

Something like this could be centralized running on docker.
individual mic/speaker could be wired back to central location

It may be possible to run standard USB conference speaker/mic connected by balun, maybe 150’ unpowered and more with powered balun. Havent tested that so not sure lengths.

1 Like