ESPHome Full-Duplex Audio Intercom - PBX-lite, Voice Assistant, TCP/UDP, and too much audio debugging
Hey everyone! ![]()
Big update for the project.
The repository is here:
The project started as a very small idea: I bought one of those cheap round Chinese ESP32-S3 “smart balls” from AliExpress, originally just because I wanted a simple
full-duplex doorbell / intercom.
Then scope creep did what scope creep does.
It grew from “press a button and talk to Home Assistant” into a full ESPHome audio stack with:
- full-duplex audio
- acoustic echo cancellation
- Voice Assistant and Micro Wake Word running on the same device
- TCP and UDP intercom transports
- a Home Assistant integration
- a Lovelace softphone card
- LVGL touchscreen UIs
- codec-aware volume handling
- and now a real PBX-lite call model
So the project is still for the simple use case, but the internal model is now much cleaner and more powerful.
The important warning first
The next release, 2026.5.0, will break old YAMLs.
This is intentional.
The project had grown too much around old compatibility layers, old filenames, old protocol assumptions, and a few “it works, but it is not really clean” routing
tricks. The current dev branch is moving to a clearer model and cleaner YAML structure.
If you are already using the old versions, do not blindly flash the next release over your current setup. Read the README on the dev branch first and expect to
adapt your YAMLs.
The break is annoying, but it is worth it: the result is much easier to reason about, easier to extend, and much closer to what I wanted the project to become.
The new mental model: every ESP is an extension
The old project behaved a bit like a PBX, but internally it was still using several shortcuts.
For example, Home Assistant could notice that an ESP went into an outgoing state, inspect the selected destination, and then connect it to another peer. It worked,
but the model was backwards: HA was doing too much interpretation.
The new model is simpler:
Every ESP is treated as an independent phone extension.
Each ESP can:
- originate a call
- receive a call
- answer
- decline
- hang up
- expose its current state
- expose its selected destination
- expose the last terminal reason
- be mirrored by the Home Assistant card
Home Assistant is also treated as a peer in the same system. It can originate and receive calls too, just like a softphone.
So this is no longer “PBX-like”.
It is a small PBX-lite system.
You can still use it as a normal doorbell
Do not be scared by the PBX terminology.
If all you want is:
- one ESP at the door
- a Home Assistant dashboard card
- call / answer / hangup
- auto-answer if wanted
- full-duplex audio
that still works.
In that case the ESP has a very small phonebook, usually just the Home Assistant peer, and the card behaves like the softphone on the other side.
The bigger model is there so that the simple model is not a special hack anymore.
How the card behaves now
The Lovelace card has two different behaviours depending on the selected destination.
1. ESP selected destination is another ESP
In this case the card mirrors the ESP.
If you press Call on the card, it presses the ESP's real call action. The ESP originates the call. The card just reflects what the ESP is doing:
- Idle
- Outgoing
- Ringing
- Streaming
- Busy
- Declined
- Remote hangup
- Remote device lost
- DND
- Timeout
- protocol / bridge errors
This matters because ESP-to-ESP calls may not pass through Home Assistant at all.
The card should not pretend to be the caller when the ESP is the caller. It should mirror the device.
2. ESP selected destination is Home Assistant
If the selected contact matches the Home Assistant instance name, for example Home or Office, then the card behaves like a softphone.
In that case Home Assistant originates the call toward the ESP, and the ESP sees an incoming call from the HA peer.
So:
- selected destination = another ESP -> card mirrors the ESP
- selected destination = HA -> card is the HA softphone
That distinction made the whole system much cleaner.
TCP and UDP are both first-class transports
The project now has two real intercom transports:
- TCP for framed signaling + audio over one reliable connection
- UDP with raw PCM audio and a separate framed control port
Default ports:
- TCP signaling/audio:
6054 - UDP audio:
6054 - UDP control:
6055
TCP and UDP firmware can coexist on the same network.
If both peers use the same transport, they can communicate directly when the phonebook entry points to the peer.
If the protocols differ, Home Assistant can bridge them.
Example:
- TCP ESP -> TCP ESP: direct possible
- UDP ESP -> UDP ESP: direct possible
- TCP ESP -> UDP ESP: through HA bridge
- UDP ESP -> TCP ESP: through HA bridge
Unified phonebook
The phonebook is now protocol-aware.
Home Assistant publishes one logical roster through:
sensor.intercom_phonebook
Rows look like this:
Name|tcp|ip|tcp_port
Name|udp|ip|udp_audio_port|udp_control_port
Name|ha|ip|tcp_port|udp_audio_port|udp_control_port
The ESP firmware subscribes to that single phonebook and normalizes the entries locally for its own transport.
This replaces the older split TCP/UDP phonebook approach.
It also makes cross-protocol routing understandable: if a contact cannot be reached directly by the current firmware transport, the entry can point to Home Assistant
and preserve the real destination name in the signaling payload.
## HA-IS-PBX mode
There is also an optional HA-IS-PBX routing mode.
Normally, if two ESPs can call each other directly, they do.
But sometimes you want every call to pass through Home Assistant:
- to keep call history visible in HA
- to bridge transports
- to forward calls
- to centralize routing rules
- to build automations around calls
With HA-IS-PBX enabled, the ESP delegates routing to Home Assistant even when it knows about the destination.
This is optional.
Direct ESP-to-ESP is still a core goal of the project.
## ESP-side auto-discovery
A proper ESP-side discovery system is still planned.
The project already has the PBX-lite model needed for it, and the goal is that compatible ESPs on the same LAN can discover each other and become callable without HA
being required as the phonebook authority.
However, I am being careful here.
The previous mDNS browser implementation exposed timing/stability problems on real devices, especially around reconnection and Home Assistant restarts. So ESP-side
discovery is being redesigned and soak-tested before it becomes part of the standard YAMLs again.
For now, the stable dev baseline is:
- HA publishes the protocol-aware phonebook
- ESPs subscribe to it
- direct calls still work when the phonebook entry points directly to the peer
- ESP-side mDNS discovery will come back only when it is solid
## Call reasons now propagate
Terminal reasons are now part of the model.
If a remote ESP declines because Do Not Disturb is enabled, the caller should receive that reason and display it.
Examples:
- Local hangup
- Remote hangup
- Busy
- DND
- Timeout
- Remote device lost
- Unreachable
- Protocol error
- Bridge error
Internally these remain machine-readable, but the ESP and card expose user-facing reason text.
This is important because the system should behave like phones do: if the remote side is busy or in DND, the caller should know why the call ended.
## Do Not Disturb
DND is now a native intercom control.
When enabled, an incoming call is declined with reason DND.
That reason travels back to the caller instead of being swallowed by Home Assistant.
Home Assistant should forward signaling, not consume it and invent a different story.
## Audio stack
The audio side also changed a lot.
The project contains several ESPHome external components:
### intercom_api
The main intercom component:
- call state machine
- TCP/UDP signaling
- phonebook
- routing mode
- auto-answer
- DND
- call / answer / decline / hangup actions
- terminal reason reporting
### i2s_audio_duplex
A full-duplex I2S component for single-bus audio.
Standard ESPHome i2s_audio is great, but it does not cover the simultaneous mic + speaker single-bus cases this project needs.
This component is used for:
- codec boards such as ES8311 / ES7210 + ES8311
- MEMS mic + I2S amplifier boards
- full-duplex intercom
- Voice Assistant
- Micro Wake Word
- media playback
- echo reference generation
### esp_aec
A lighter ESP-SR acoustic echo cancellation wrapper.
This is the standard choice for:
- intercom-only builds
- generic full-experience builds
- single-mic boards where the full AFE pipeline is not needed
### esp_afe
A full Espressif AFE pipeline wrapper.
Used where the extra frontend stages are worth the RAM/CPU cost:
- Echo Cancellation
- Noise Suppression
- Auto Gain Control
- Voice Activity Detector
- Speech Enhancement on dual-mic boards
The exposed controls are now named with full user-facing names instead of short internal labels.
Single-mic boards expose:
- Echo Cancellation
- Noise Suppression
- Auto Gain Control
- Voice Activity Detector
Dual-mic boards expose:
- Echo Cancellation
- Speech Enhancement
- Voice Activity Detector
### Codec-aware volume
Boards with an ES8311 codec use hardware DAC volume as the real master volume.
Generic boards without a codec use software speaker volume.
This avoids mixing up “media player volume”, “software speaker gain”, and “hardware DAC volume”, which are not the same thing.
## Voice Assistant and Micro Wake Word
Full-experience YAMLs combine:
- intercom
- Voice Assistant
- Micro Wake Word
- AEC or AFE processing
- media player
- LVGL UI where the hardware has a display
The goal is still to make an ESPHome device that can act like a serious local voice/intercom endpoint, not just a toy demo.
There are different AEC modes depending on board class and available resources.
As a rule of thumb:
- intercom-only builds use the lighter VoIP-style AEC profile
- full-experience single-mic builds use SR AEC
- dual-mic full AFE boards use the AFE pipeline
- generic full-experience boards use esp_aec, not full esp_afe, to stay within realistic memory/flash limits
## Ready-to-use YAMLs
Current dev YAMLs are organized by use case and transport.
### Intercom only
yamls/intercom-only/single-bus/spotpear-ball-v2-intercom-tcp.yaml
yamls/intercom-only/single-bus/spotpear-ball-v2-intercom-udp.yaml
yamls/intercom-only/single-bus/generic-s3-intercom-tcp.yaml
yamls/intercom-only/single-bus/generic-s3-intercom-udp.yaml
### Full experience, AEC
yamls/full-experience/single-bus/aec/spotpear-ball-v2-full-aec-tcp.yaml
yamls/full-experience/single-bus/aec/generic-s3-full-aec-tcp.yaml
yamls/full-experience/single-bus/aec/generic-s3-full-aec-udp.yaml
yamls/full-experience/single-bus/aec/waveshare-p4-touch-full-aec-tcp.yaml
### Full experience, AFE
yamls/full-experience/single-bus/afe/spotpear-ball-v2-full-afe-tcp.yaml
yamls/full-experience/single-bus/afe/spotpear-ball-v2-full-afe-udp.yaml
yamls/full-experience/single-bus/afe/waveshare-s3-full-afe-tcp.yaml
yamls/full-experience/single-bus/afe/waveshare-s3-full-afe-udp.yaml
yamls/full-experience/single-bus/afe/waveshare-p4-touch-full-afe-tcp.yaml
yamls/full-experience/single-bus/afe/waveshare-p4-touch-full-afe-udp.yaml
### Experimental dual-bus
yamls/experimental/dual-bus/intercom-only/esp32-s3-mini-intercom.yaml
yamls/experimental/dual-bus/intercom-only/generic-s3-dual-intercom.yaml
yamls/experimental/dual-bus/full-experience/esp32-s3-mini-full.yaml
The tested focus right now is ESP32-S3 / ESP32-P4 with PSRAM and ESP-IDF.
## Tested hardware focus
The project currently focuses on:
- Spotpear Ball v2
- Waveshare ESP32-S3 Audio
- Waveshare ESP32-P4 Touch
- generic ESP32-S3 single-bus MEMS mic + I2S amplifier setups
- experimental dual-bus S3 setups
Hardware requirements depend on the YAML, but in general:
- ESP32-S3 or ESP32-P4
- PSRAM strongly recommended / required for full builds
- ESP-IDF framework
- I2S mic and speaker path
- correct codec / pin configuration for your board
Generic YAMLs are examples, not magic. You still need to adapt pins and hardware details to your board.
## Home Assistant integration
The intercom_native custom integration provides:
- TCP listener
- UDP socket manager
- browser WebSocket audio for the Lovelace card
- protocol-aware phonebook sensor
- call / answer / decline / hangup / forward services
- transport-aware routing
- bridge logic for TCP <-> UDP
- call state and reason propagation
The Lovelace card acts either as:
- a mirror of the ESP, when the ESP is calling another ESP
- a softphone, when the ESP selected destination is Home Assistant
This distinction is one of the biggest usability improvements in the new model.
## Breaking changes recap
The next release moves to version:
2026.5.0
The project is adopting a Home Assistant-style calendar versioning scheme.
Breaking changes include:
- old YAML names replaced by TCP/UDP-specific filenames
- old Simple / Full mode removed
- PBX-lite is now the implicit product model
- mode: is optional and only used for raw UDP audio mode
- old split phonebook sensors replaced by one protocol-aware phonebook
- legacy ESPHome service wrappers for contact mutation removed
- old UDP helper component removed
- protocol framing changed
- MSG_STOP renamed to MSG_HANGUP
- MSG_DECLINE added
- HA and ESP firmware must be updated together
- old Xiaozhi naming replaced by Spotpear Ball v2 naming
- automations using old entity IDs may need updates
- Home Assistant integration now depends on zeroconf
- old YAMLs will need manual adaptation
Again: this is not a tiny patch release. It is a model cleanup.
## Why this was worth doing
The old project worked, but too many behaviours were emerging from side effects.
The new model gives the project a real structure:
- ESPs are extensions
- HA is an extension and optional PBX
- calls have source, destination, state and reason
- TCP and UDP are transports, not different products
- phonebook rows are protocol-aware
- card behaviour follows the selected destination
- DND, busy and hangup reasons propagate
- direct calls and HA-bridged calls can coexist
Once that model is in place, the project becomes easier to extend.
The next big piece is bringing ESP-side discovery back in a clean and stable way, so devices can discover peers directly without Home Assistant being the only
phonebook source.
## Useful docs
Before upgrading, read the dev branch documentation:
- README: https://github.com/n-IA-hane/esphome-intercom
- Architecture notes: docs/ARCHITECTURE.md
- Reference: docs/reference.md
- Deployment guide: docs/DEPLOYMENT_GUIDE.md
- Protocol docs: docs/INTERCOM_PROTOCOL.md
- Phonebook docs: docs/PHONEBOOK_PROTOCOL.md
## Small disclaimer
This project is still very much a real-world debugging battlefield.
Full-duplex I2S, ESP-SR, AEC reference paths, wake word timing, LVGL, Wi-Fi behaviour, PSRAM placement, codec registers, UDP/TCP signaling, and Home Assistant state
reflection all interact in fun and occasionally cursed ways.
I used a lot of help from advanced reasoning agents while developing and auditing the code, but every change still needs real hardware testing. If you try the dev
branch, expect movement, expect breaking changes, and please report what hardware you tested.
:point_right: GitHub: https://github.com/n-IA-hane/esphome-intercom
Questions, testing reports, hardware donations, bug reports, and “why did you do this to yourself?” comments are welcome.
