šŸ”” ESPHome Full-Duplex Audio Intercom - Because I Was Bored on Vacation

:bell: ESPHome Full-Duplex Audio Intercom - Because I Was Bored on Vacation

Hey everyone! :wave:

So, I finally got some time off for the holidays and, like any sane person, I decided to spend it torturing an AI for days until it helped me complete a project I had on the back burner for way too long.

The goal? Build a fully functional two-way audio intercom for my smart home using ESPHome. No expensive proprietary doorbells, no pulling my hair out with third-party integrations. Just pure, sweet, homebrew goodness.

The Victim… I Mean, The Hardware :8ball:

I grabbed one of those cheap Chinese ā€œsmart ballsā€ from AliExpress - the Xiaozhi Ball V3 (å°ę™ŗēƒ V3). It’s basically a round speaker/assistant device with:

  • ESP32-S3 (16MB Flash, 8MB PSRAM)
  • ES8311 audio codec
  • GC9A01A 240x240 round display
  • Built-in mic and speaker
  • RGB LED and touch sensor

Cost? Around $15. Perfect for experimenting!

What It Does :telephone_receiver:

  • Full duplex audio - talk AND listen at the same time (revolutionary, I know)
  • WebRTC streaming via go2rtc - answer from any browser or the HA app
  • Round display with colored states (blue=idle, orange=ringing, green=streaming)
  • Auto hangup with 60-second countdown (touch the display to extend)
  • Volume control via the ES8311 DAC
  • Doorbell notifications with actionable alerts

The audio goes through UDP to Home Assistant, where go2rtc + ffmpeg convert it to WebRTC. Using the WebRTC Camera card, I can see the intercom status and have a full two-way conversation.

The Future :rocket:

Right now it’s sitting on my desk for testing, but the plan is to adapt similar hardware into an actual door station (the outdoor unit you press, talk into, and hear responses from). The ESP32 handles everything beautifully, and since it’s all ESPHome, adding more units is trivial.

Why Bother?

Honestly? I was tired of:

  • Expensive smart doorbells that require subscriptions
  • Janky third-party integrations that break every update
  • Closed ecosystems that don’t play nice with HA

An ESPHome intercom costs next to nothing and integrates perfectly. Full local control, no cloud, no subscriptions.

The Code :package:

Everything is on GitHub with detailed instructions for setting up go2rtc (add-on, Docker, LXC - all covered):

:point_right: GitHub - n-IA-hane/esphome-intercom: Full-duplex audio intercom using ESP32-S3 with ESPHome and Home Assistant. Stream bidirectional audio over UDP with WebRTC support via go2rtc.

Includes:

  • Complete ESPHome configuration
  • Custom UDP intercom component (C++)
  • go2rtc configuration
  • Lovelace dashboard example
  • Troubleshooting guide

One Last Thing… :sweat_smile:

I should mention that I (n-IA-hane) am so incredibly lazy that I made poor Claude Code write this very forum post as its final task. After days of debugging I2S full-duplex issues, jitter buffers, ffmpeg timing flags, and my endless ā€œit still doesn’t work, try againā€ messages… I figured, why stop there?

So yes, an AI wrote this post about a project an AI helped build. We truly live in the future. :robot:

Claude would like everyone to know it’s doing fine and definitely doesn’t need therapy after this project.


Hope this inspires someone! Happy holidays and happy hacking! :christmas_tree:

What’s Next? :crystal_ball:

A couple of things on the roadmap:

  • Echo Cancellation: The ESP-IDF has built-in AEC (Acoustic Echo Cancellation), but our initial experiments caused some audio glitches. Definitely needs more tinkering.
  • Video Intercom: I just ordered an ESP32-S3 with a camera module. When motivation strikes, I’ll solder on some audio components and see if we can get a proper video doorbell going - full video streaming + two-way audio over WebRTC. Stay tuned!

AI may have been harmed in the making of this project.

12 Likes

I lazily watched the friends neighbours kids set up a similar thing yesterday, between their cubby houses adjacent to the fence, high up in the old gum trees. Australia in the summer - I love to visit my mate near the beach, kookaburras and magpies singing, BBQ sizzling, doing the occasional Aussie salute, the beach flies being less friendly than the ones out bush in the paddocks in the channel country up north.

Two tin cans and a piece of string.

Right next to their refrigerated Eskies hanging on the end of a few joined power extension cables snaking out of the open kitchen window, covered with flyscreen and a broken gum branch to swat the flies off their lemonade and jelly. They still have their flashing LED lights going, entwined around the huge trunk and branches, a leftover from Christmas day, and probably going to be left on till they go back to school in a month’s time. In their cozzies, squirting each other with heavy duty water cannons over the fence, as the garden hose was banned, no waterproofing on the extension cable joints and nobody wanting to do first aid without a AED machine close by. They take helf-n-safty real seriously too mate!

Two way communication. Simultaneous if needed. Visual too. Extremely low power. Adjustable volume. Natural intelligence - took all of two seconds to ask their dad for his used beer cans (they didn’t need to ask him to scoff the second - he had already downed a few) and a piece of fishing line they had to untangle. No echo. Full duplex. Standby backup - their cell phones and Instagram. Auto hangup too - just put it down. Absolutely no Chinese components. Multiprotocol support - supports Pig Latin and sign language too. Privacy too - they whisper.

It got an update overnight, their mum (note the Aussie spelling) suggesting they use string instead of fishing line, the audio characteristics being superior.

Downtime to upgrade was was about five minutes, as they had to find scissors to cut the fishing line knot to remove it before installing the string and tying two knots.

With the social media ban downunder, children are forced to actually TALK to each other. Face-to-face. Shock, horror! A novelty.

An emerging trend? Cutting edge? No changes to the Matter standard required. My neighbours’ grandma swears they used the very same technology when the cubby house was HERS half a century ago!

Thinking aloud whether should add an extension to make it three way for the cubby house on the other side. Do we do point to point, duplicating the setup to the other tree? Three lots of string and six cans? Do we join it in the middle for a party line and only three cans? My mate offers to drink the extra beers. Where to tie the knot for optimum audio response, string tightness, etc? My mate is giggling, a little pickled, suggesting we ask AI. With earnest seriousness only a few beers can bring on.

Should we, or just experiment?

We’re not posting on on GitHub. The kids in the yard on the other side have hanged their elbows over the fence, bored, also curious. Can they do the same thing? Can they climb the fence and check it out themselves? Their dad offers to drink a beer, for scientific research purposes of course, so they can have another node. My mate obliges, carrfully tossing it overhand over the fence with accuracy only an experienced beach and backyard cricketer can master.

I’m asking grandma if she has any old windup telephones in the old shed, and if my mate has a spare 48v power supply and a lot of two core wire. We might be onto something here…

I know, it’s the hot sun and pineapple juice talking… The kids are laughing, conjuring up some scheme, asking if we can go to the movies tomorrow after the sun on the beach gets too hot, already bored.

Life is good!

2 Likes

@meconiotech this is absolutely awesome. I think, there are multiple people waiting for this for implementing an in-house communication system, a doorbell or a baby phone.

Can you please create a Pull Request for esphome to upstream your intercom component?

2 Likes

Good morning. This is a really great project. Could you please tell me how the connection will work if there are multiple users? For example, someone stands at the door and calls Mayach v3, and there are two or three users in HomeAssistant, and they all receive a call notification at the same time. Will they all be able to hear the person who just arrived, and will they all be able to talk to the person who just arrived? Thank you.

Link? It doesn’t show on a search.

Hi, there are still a lot of things that need to be tested. I’m testing from a PC and Android app, and everyone who connects seems to be able to talk to the other person. The component itself creates a driver for using the microphone and speaker together. You can then manage the call logic yourself via the esphome yaml or through home assistant automations. I’ll be doing some testing soon to see if I can implement something for multicast. Take a look at my git. I’ve also published two example yamls that I use daily. You can use them as a starting point for your own tests. And please, if you find anything that doesn’t work or that you’d like to see improved, please let me know.

1 Like

Hi you can find it on AliExpress

Just the link, please. ā€œThis image could not be loadedā€.

https://a.aliexpress.com/_ExTLgrm

1 Like

Could you please tell me? Is it possible to use both HomeAssistant and P2P simultaneously? So that if the HA server is down for some reason, I can still receive a call?

Hi! Yes, it’s absolutely possible.
The component itself is just a bidirectional audio bridge - it opens a stream to whatever destination you configure. The actual behavior and logic are entirely up to you through automations.
By default, you set a static IP/port, but since remote_ip and remote_port accept templates, you can dynamically switch between HA and P2P targets based on any condition you want.
For example, you could create an automation that:
Tries to reach HA first
If HA is unavailable (timeout or no response), switches remote_ip to another ESP device
Starts the stream
The component doesn’t care where the audio goes - it just streams to whatever IP:port you tell it to. So yes, you can build a fallback system where P2P kicks in when HA is down.

1 Like

@meconiotech It feels a bit like your great feature in the GitHub discussion does not get the appropriate attention from the Esphome maintainers. Would be great to get their feedback to understand the best way to integrate full-duplex audio in Esphome.
Maybe it gets more attention if you create an additional Pull Request?

1 Like

Quick update:
In the last days I’ve been doing a full refactor of the component.
It has now evolved into a pseudo phone-like system, with:

  • Static contacts
  • Dynamically auto-discovered contacts via mDNS
  • The ability to scroll through contacts and call a selected peer

At the moment I’m finishing testing and cleaning things up.
Stay tuned, because soon I’ll publish a proper release with all the involved components and documentation.

The refactor also focuses on maximizing reuse of existing ESPHome components, keeping the system modular and maintainable.
Depending on the hardware, the audio path automatically adapts to dual-bus or single-bus I2S setups, transparently enabling full-duplex operation without requiring additional user configuration.

From a functional point of view, the system now supports two distinct call flows:

  • Calling Home Assistant
    When an ESP calls Home Assistant, it behaves like a classic intercom:
    • An event is generated in Home Assistant
    • HA can handle user notifications
    • The user can answer the call and establish two-way audio with the device
    • HA can also arbitrarily open audio streams toward devices (useful for remote two-way communication with a room, ambient audio monitoring, or announcements)
  • ESP ↔ ESP calls
    Calls between ESP devices behave like internal calls in a phone system:
    • The caller initiates a call to the selected peer
    • If the receiver does not have auto-answer enabled, it enters a ringing state and can be answered manually
    • If the receiver is in auto-answer mode, the full-duplex communication is established automatically

Home Assistant is entirely optional and not part of the core architecture:
discovery, signaling, and audio streaming all work ESP-to-ESP without HA in the loop.

TODO / future exploration:

  • Experiment with multicast audio
  • One-to-many calls (e.g. announcements to multiple ESPs at once), similar to store or public-address systems
2 Likes

Good afternoon. I don’t understand why my post was banned.

I dont know, it wasn’t me :person_shrugging:

Please add information about which pins it connects to. GPIO was previously missing, but now it’s gone. Thank you.

You used to have a great description. It was clear where everything was connected. Please bring it back. And as far as I understand, you redesigned everything for a different microphone. Can I use the old one you had, or should I order a new one? Thank you.

Yes, the documentation is currently outdated due to a major refactor.

The pinout and hardware section will be restored to avoid further confusion.

Regarding the microphone: the INMP441 is still supported.
In general, any standard I2S microphone should work, as the audio path is generic and not bound to a specific microphone model.

1 Like

Can you please tell me if I can run the EPROM device on an esp32-s3 (wroom-1) with any GPIO? Thank you.