Major development update: PBX-lite rewrite / 2026.5.0-dev
Hi everyone,
I am preparing a major development update for ESPHome Intercom.
This is not a small feature release. It is a semantic cleanup of the whole project, and I want to explain it clearly before it lands because it will affect existing
YAMLs.
The short version:
The next release will break existing YAMLs, but it is moving the project toward a much cleaner and more flexible architecture.
The project started as a way to make browser-to-ESP and ESP-to-Home-Assistant full-duplex intercom calls work. Over time, it grew into something much bigger: ESP-to-
ESP calls, touch UIs, Voice Assistant coexistence, AEC/AFE audio processing, TCP and UDP transport, routing, call states, and Home Assistant integration.
At that point the old model was no longer enough.
From PBX-like to PBX-lite
The project is moving from āPBX-like behaviorā to a real PBX-lite model.
Every ESP is now treated as an independent extension.
That means an ESP can:
- originate a call
- receive a call
- ring
- answer
- decline
- hang up
- report why a call ended
- keep its own call state
- call another ESP directly when possible
Home Assistant is no longer treated as the hidden owner of every call. It becomes another member of the PBX system: it can call ESPs, receive calls from ESPs, and
optionally bridge calls between devices.
Do not be scared by the telephone terminology.
When I say PBX, extension, softphone, routing, or call state, I am not saying you must build a telephone system. You can still use this exactly like before: one ESP
at the door, one Home Assistant card, auto-answer if you want an intercom, manual answer if you want it to ring first.
The point of the new model is that the simple case becomes just one clean case of the bigger system, instead of being held together by special-case logic.
A one-device setup is simply a PBX-lite setup with one destination in the phonebook: Home Assistant.
Unified phonebook
The old split between separate TCP and UDP phonebooks is being replaced by a single protocol-aware phonebook.
The new logical format is:
Name|tcp|ip|tcp_port
Name|udp|ip|udp_audio_port|udp_control_port
Name|ha|ip|tcp_port|udp_audio_port|udp_control_port
Home Assistant publishes one roster:
sensor.intercom_phonebook
The ESP firmware subscribes to that single roster and locally shapes it for its own transport.
This means:
- TCP firmware keeps TCP peers direct.
- UDP firmware keeps UDP peers direct.
- TCP-to-UDP and UDP-to-TCP calls go through Home Assistant.
- The real destination name is preserved even when HA is used as bridge.
So if a TCP ESP calls a UDP ESP, it does not pretend the destination is Home Assistant. The call still targets the real ESP, but the network endpoint points to HA
because HA must bridge between protocols.
Direct ESP-to-ESP calls
Same-protocol devices can call each other directly.
Examples:
- TCP ESP -> TCP ESP: direct call
- UDP ESP -> UDP ESP: direct call
- TCP ESP -> UDP ESP: bridged by Home Assistant
- UDP ESP -> TCP ESP: bridged by Home Assistant
This is the important architectural change: the ESP devices are real call endpoints, not just Home Assistant-controlled speakers/microphones.
One way or another, the project will also include ESP-side auto-discovery.
The goal is that if multiple intercom ESPs are on the same network, they should be able to discover each other, merge those peers into the local phonebook, and call
each other even before Home Assistant gets involved.
There is already work in progress around a generic ESPHome mdns_browser component for this. I am treating it as infrastructure, not as an intercom-only shortcut:
it should be able to browse arbitrary mDNS service types and let YAML/components decide how to use those discovered services.
For the current dev baseline, the stable source of truth is still the Home Assistant-published phonebook. ESP-side discovery will be enabled in the standard YAMLs
only after it has been validated long enough to avoid boot or reconnect instability.
Home Assistant as PBX
Home Assistant can now act in two ways.
First, Home Assistant can be a destination.
If the selected ESP destination is the Home Assistant instance name, for example Home, Office, or whatever you configured in HA settings, the Lovelace card
behaves like a softphone. Pressing Call means Home Assistant calls the ESP, and the ESP sees an incoming call from HA.
Second, Home Assistant can be a bridge.
This is useful when:
- two ESPs use different transports
- you want the call path to go through HA
- you want HA to keep visibility of a call
- you explicitly enable HA-as-PBX routing
This mode is optional. Direct peer-to-peer is still the cleanest path when both devices can talk directly.
Lovelace card behavior
The card now follows a simple rule.
If the selected ESP destination is another ESP, the card mirrors the ESP.
Pressing Call on the card presses Call on the ESP. The ESP originates the call. The card reflects the ESP state: outgoing, ringing, streaming, destination, caller,
and end reason.
If the selected ESP destination is Home Assistant, the card behaves like a softphone.
This makes the card less magical and more predictable. It mirrors the device when the ESP owns the call, and it becomes a HA endpoint only when HA itself is
selected.
There is also a show_protocol option so the card can show TCP/UDP context and distinguish normal ESP-to-ESP calls from inter-protocol calls.
Call reasons and Do Not Disturb
Call end reasons now travel through the protocol.
For example, if a device has Do Not Disturb enabled and another ESP calls it, the callee can send:
DND
The caller receives that reason and can show it on screen. Home Assistant and the Lovelace card preserve the reason instead of replacing it with a generic message.
Free-form reason strings are also supported. The PBX layer should forward the real reason, not invent a new one.
This matters because the ESP remains authoritative when the card is mirroring an ESP-to-ESP call. The card should show what the ESP reports, not reinterpret the call
as if HA owned it.
Audio stack cleanup
The audio stack is being cleaned up into clearer layers:
i2s_audio_duplex: full-duplex I2S, speaker/mic paths, AEC reference capture, FIR decimation, runtime audio controls
esp_aec: lightweight echo cancellation path
esp_afe: full Espressif Audio Front-End path for advanced boards
audio_processor: shared interface used by both AEC and AFE processors
Generic full-experience YAMLs are moving toward the lighter AEC path by default. Full AFE remains for boards where it makes sense and has been validated.
AFE user controls are now aligned with board topology.
Single-mic boards expose:
- Echo Cancellation
- Noise Suppression
- Auto Gain Control
- Voice Activity Detector
Dual-mic boards expose:
- Echo Cancellation
- Speech Enhancement
- Voice Activity Detector
Codec boards use hardware Master Volume through the codec path. Generic/no-codec boards can use software speaker volume. This avoids confusing media-player volume,
intercom volume, and codec DAC volume.
Voice Assistant coexistence
The full-experience YAMLs are still meant to combine intercom, full-duplex audio, media playback, wake word, and Home Assistant Voice Assistant on the same device.
The target is not āeither intercom or voice assistantā.
The target is one ESP device that can be:
- an intercom endpoint
- a Voice Assistant satellite
- a media playback endpoint
- a PBX-lite extension
This is why the audio stack had to be cleaned up. Intercom audio, media playback, wake word, AEC/AFE processing, and VA all need to share the same hardware without
stepping on each other.
TCP and UDP are both first-class
TCP is still the preferred reliable signaling/audio transport for many setups.
UDP remains important for simple low-latency direct ESP use cases and for users who want lightweight ESP-to-ESP behavior.
The point of the new architecture is not to kill one transport. The point is to make both transports fit into the same call model.
That is why the phonebook now carries the protocol, and why Home Assistant can bridge only when needed.
Breaking changes
The next release will break existing YAMLs.
Main changes:
- versioning moves to calver:
2026.5.0
- firmware and HA integration must be updated together
- the wire protocol changed
- PBX-lite becomes the default product model
- the old simple/full split is gone
raw_udp remains as an explicit opt-in for raw audio use cases
- TCP and UDP variants are explicit YAML files
- the old split TCP/UDP phonebook model is replaced by a unified phonebook
- UDP control uses port
6055
- UDP audio and TCP still default to
6054
- old UDP-specific intercom code is gone; UDP now lives inside
intercom_api
- old Home Assistant phonebook push/service patterns are being replaced by the unified phonebook subscription model
- some device/entity naming is being cleaned up, so Home Assistant entities and automations may need to be renamed
Please do not update blindly when this lands.
Read the dev branch documentation first.
Documentation to read before upgrading
The most important docs are:
README.md
docs/INTERCOM_PROTOCOL.md
docs/PHONEBOOK_PROTOCOL.md
docs/DEPLOYMENT_GUIDE.md
- component READMEs under
esphome/components/
The goal is to make the project easier to reason about long-term.
This may be uncomfortable during migration, but it gives the project a real foundation: independent ESP extensions, Home Assistant as an optional PBX member,
protocol-aware routing, cleaner audio layers, and a phonebook model that can scale beyond the original doorbell use case.
Current status
This is still active development.
The stable baseline right now is:
- Home Assistant publishes the unified phonebook.
- ESPs subscribe to it.
- ESP-to-ESP calls work when the phonebook gives them a compatible endpoint.
- Home Assistant bridges when transport/protocol differences require it.
- ESP-side mDNS discovery is being redesigned before it returns to the standard YAMLs.
So yes, the project is temporarily more disruptive, but the reason is to stop accumulating special cases and move toward a model that can scale.
The old system worked, but it was becoming difficult to explain and maintain.
The new system is stricter, but much easier to reason about.