šŸ”” ESPHome Full-Duplex Audio Intercom

As soon as I disable the mic the speaker is working as aspected.
If not sometimes word are missing / sounds choppy.
Doesnt matter if aec is on or off.
The mic is working nice all time.
Feels like the device cant handle mic and speaker at same time.

esphome:
  name: "jc4880p443"
  friendly_name: JC4880P443
  on_boot:
    priority: 600
    then:
      - output.turn_on: speaker_enable
      - delay: 1s
      # Configure ES8311 for digital feedback AEC
      # Register 0x44 bits[6:4] = ADCDAT_SEL: 4 = DACL + ADC
      # This makes ASDOUT output stereo: L=DAC loopback (reference), R=ADC (mic)
      - lambda: |-
          ESP_LOGI("es8311", "Configuring ES8311 register 0x44 for digital feedback...");
          uint8_t reg = 0x44;
          uint8_t current_val = 0;
          id(i2c_bus).write(0x18, &reg, 1);
          id(i2c_bus).read(0x18, &current_val, 1);
          ESP_LOGI("es8311", "Register 0x44 current value: 0x%02X", current_val);
          uint8_t data[2] = {0x44, 0x48};
          auto err = id(i2c_bus).write(0x18, data, 2);
          if (err == esphome::i2c::ERROR_OK) {
            ESP_LOGI("es8311", "Wrote register 0x44=0x48 for digital feedback AEC");
            id(i2c_bus).write(0x18, &reg, 1);
            id(i2c_bus).read(0x18, &current_val, 1);
            ESP_LOGI("es8311", "Register 0x44 after write: 0x%02X", current_val);
          } else {
            ESP_LOGE("es8311", "Failed to write register 0x44: error %d", (int)err);
          }
      # Restore ES8311 volume and sync AEC reference
      - lambda: |-
          float vol = 0.15 + (id(speaker_volume).state / 100.0) * 0.60;
          id(es8311_dac).set_volume(vol);
          id(i2s_duplex).set_aec_reference_volume(vol);
          id(peer_name) = id(intercom).get_current_destination();


  

logger:
  hardware_uart: USB_SERIAL_JTAG
  level: DEBUG
  logs:
    intercom_api: INFO
    i2s_duplex: DEBUG
    component: INFO
    wifi: INFO
    api: INFO
    ota: INFO
    mdns: INFO
    sensor: INFO
    switch: INFO
    light: INFO
    display: INFO
    image: INFO
    animation: INFO
    spi: INFO
    i2c: INFO
    esp32: INFO

wifi:
  ssid: !secret wifi_ssid
  password: !secret wifi_password
  fast_connect: true
  post_connect_roaming: false


time:
  - platform: sntp
    id: my_time
    timezone: Europe/Rome
    servers:
      - 0.pool.ntp.org
      - 1.pool.ntp.org
      - 2.pool.ntp.org


ota:
  - platform: esphome

# =============================================================================
# CONNECTIVITY
# =============================================================================
api:
  on_client_connected:
    - lambda: |-
        static bool published = false;
        if (!published) {
          published = true;
          id(intercom).publish_entity_states();
        }


font:
  - file: "gfonts://Montserrat"
    id: montserrat_28
    size: 28

output:
  - id: gpio_backlight_pwm
    platform: ledc
    pin: 23
  - platform: gpio
    id: speaker_enable
    pin: GPIO11

light:
  - id: backlight
    name: Backlight
    platform: monochromatic
    output: gpio_backlight_pwm
    restore_mode: ALWAYS_ON




binary_sensor:
  - platform: status
    name: Status







# =============================================================================
# GLOBALS
# =============================================================================
globals:
  - id: init_in_progress
    type: bool
    restore_value: false
    initial_value: "true"


  - id: text_page_index
    type: int
    restore_value: false
    initial_value: "0"

  - id: text_pages
    type: std::vector<std::vector<std::string>>
    restore_value: false

  - id: global_is_timer_active
    type: bool
    restore_value: false

  - id: global_is_timer
    type: bool
    restore_value: false

  # Ping-pong animation direction
  - id: anim_direction
    type: bool
    restore_value: false
    initial_value: "true"

  # Geometry cache (precomputed for circular display)
  - id: x_metrics
    type: std::vector<int>
    restore_value: false
  - id: y_metrics
    type: std::vector<int>
    restore_value: false
  - id: chord_widths_cache
    type: std::vector<int>
    restore_value: false

  # Mode: 0=VA, 1=Intercom
  - id: current_mode
    type: int
    restore_value: no
    initial_value: "0"

  # Previous mode (for restoring after incoming call)
  - id: previous_mode
    type: int
    restore_value: no
    initial_value: "0"

  # Intercom peer name for display
  - id: peer_name
    type: std::string
    restore_value: no
    initial_value: '"Home Assistant"'

esp32:
  board: esp32-p4-evboard
  #cpu_frequency: 360MHz
  flash_size: 16MB
  framework:
    type: esp-idf
    sdkconfig_options:
      CONFIG_ESP32S3_DEFAULT_CPU_FREQ_240: "y"
      CONFIG_ESP32S3_DATA_CACHE_64KB: "y"
      CONFIG_ESP32S3_DATA_CACHE_LINE_64B: "y"
      CONFIG_FREERTOS_HZ: "1000"
      CONFIG_ESP32_WIFI_TASK_PINNED_TO_CORE_1: "y"
      CONFIG_LWIP_MAX_SOCKETS: "16"
      # Use dynamic TLS buffers to reduce peak memory usage for SSL connections
      CONFIG_MBEDTLS_DYNAMIC_BUFFER: "y"
      CONFIG_MBEDTLS_DYNAMIC_FREE_PEER_CERT: "y"
      CONFIG_MBEDTLS_DYNAMIC_FREE_CONFIG_DATA: "y"
      # Smaller TLS buffers (default 16384 is too much with all components active)
      CONFIG_MBEDTLS_SSL_IN_CONTENT_LEN: "8192"
      CONFIG_MBEDTLS_SSL_OUT_CONTENT_LEN: "4096"


# ==============================================================================
# EXTERNAL COMPONENTS
# ==============================================================================
external_components:
  - source:
      type: local
      path: esphome_components
    components: [intercom_api, i2s_audio_duplex, esp_aec]

# ==============================================================================
# AEC (Acoustic Echo Cancellation)
# ==============================================================================
esp_aec:
  id: aec_component
  sample_rate: 16000
  filter_length: 4     # 64ms echo tail (sufficient for integrated codec)
  mode: voip_low_cost   # Lightest mode, same quality as high_perf on ESP32-S3


i2s_audio_duplex:
  id: i2s_duplex
  i2s_lrclk_pin: GPIO10
  i2s_bclk_pin: GPIO12
  i2s_mclk_pin: GPIO13
  i2s_din_pin: GPIO48
  i2s_dout_pin: GPIO9
  sample_rate: 16000
  aec_id: aec_component
  # ES8311 digital feedback: RX is stereo L=DAC(reference), R=ADC(mic).
  # Sample-accurate reference alignment, no ring buffer delay needed.
  # Requires ES8311 register 0x44 bits[6:4]=4 (configured in on_boot via I2C).
  use_stereo_aec_reference: true
  aec_reference_delay_ms: 10   # Minimal (sample-aligned via stereo feedback)


# ==============================================================================
# MICROPHONE (via duplex platform)
# ==============================================================================
microphone:
  - platform: i2s_audio_duplex
    id: mic_aec
    i2s_audio_duplex_id: i2s_duplex
  - platform: i2s_audio_duplex
    id: mic_raw
    i2s_audio_duplex_id: i2s_duplex
    pre_aec: true

# =============================================================================
# SPEAKERS (mixer topology: VA + Intercom -> hw_speaker)
# =============================================================================
speaker:
  - platform: i2s_audio_duplex
    id: hw_speaker
    i2s_audio_duplex_id: i2s_duplex

  - platform: mixer
    id: audio_mixer
    output_speaker: hw_speaker
    num_channels: 1
    source_speakers:
      - id: va_speaker
        timeout: 10s
      - id: intercom_speaker
        timeout: 10s

# =============================================================================
# MEDIA PLAYER (VA TTS output through va_speaker)
# =============================================================================
media_player:
  - platform: speaker
    name: None
    id: speaker_media_player
    volume_min: 0.0
    volume_max: 0.8
    announcement_pipeline:
      speaker: va_speaker
      format: FLAC
      sample_rate: 16000
      num_channels: 1
    files:
      - id: timer_finished_sound
        file: https://github.com/esphome/home-assistant-voice-pe/raw/dev/sounds/timer_finished.flac



# ==============================================================================
# INTERCOM API (TCP-based, port 6054)
# ==============================================================================
# Auto-creates these sensors:
#   - text_sensor: intercom_state (Idle/Ringing/Streaming)
#   - text_sensor: destination (selected contact) [full mode only]
#   - text_sensor: caller (who is calling) [full mode only]
#   - text_sensor: contacts (count) [full mode only]

intercom_api:
  id: intercom
  mode: full
  microphone: mic_aec
  speaker: intercom_speaker
  ringing_timeout: 30s

# === FSM event callbacks ===
  on_incoming_call:
    - logger.log: "Incoming call"

  on_outgoing_call:
    # Fire HA event when calling "Home Assistant" (for notifications/automations)
    - if:
        condition:
          lambda: 'return id(intercom).get_current_destination() == "Home Assistant";'
        then:
          - homeassistant.event:
              event: esphome.intercom_call
              data:
                caller: !lambda 'return App.get_friendly_name();'
                destination: "Home Assistant"
                type: "doorbell"


  on_answered:
    - logger.log: "Call answered"

  on_streaming:
    - lambda: |-
        std::string caller = id(intercom).get_caller();
        if (!caller.empty()) {
          id(peer_name) = caller;
        } else {
          id(peer_name) = id(intercom).get_current_destination();
        }
    #- output.turn_on: speaker_enable

  on_idle:
    - lambda: 'id(peer_name) = id(intercom).get_current_destination();'
    #- output.turn_off: speaker_enable
    # Restore previous mode after call ends


  on_hangup:
    - logger.log:
        format: "Hangup: %s"
        args: ['reason.c_str()']

  on_call_failed:
    - logger.log:
        format: "Call failed: %s"
        args: ['reason.c_str()']

# ==============================================================================
# BUTTONS
# ==============================================================================
button:
  # Smart Call button: idle→call, ringing→answer, streaming→hangup
  # The on_outgoing_call callback handles the HA event for doorbell notifications
  - platform: template
    id: call_button
    name: "Call"
    icon: "mdi:phone"
    on_press:
      - intercom_api.call_toggle:
          id: intercom

  # Next contact (full mode)
  - platform: template
    id: next_contact_button
    name: "Next Contact"
    icon: "mdi:arrow-right"
    on_press:
      - intercom_api.next_contact:
          id: intercom

  # Previous contact (full mode)
  - platform: template
    id: prev_contact_button
    name: "Previous Contact"
    icon: "mdi:arrow-left"
    on_press:
      - intercom_api.prev_contact:
          id: intercom

  # Decline incoming call
  - platform: template
    id: decline_button
    name: "Decline"
    icon: "mdi:phone-hangup"
    on_press:
      - intercom_api.decline_call:
          id: intercom

  - platform: template
    id: refresh_contacts_button
    name: "Refresh Contacts"
    icon: "mdi:refresh"
    entity_category: config
    on_press:
      - intercom_api.set_contacts:
          id: intercom
          contacts_csv: !lambda 'return id(ha_active_devices).state;'

  - platform: restart
    name: "Restart"
    icon: "mdi:restart"

# =============================================================================
# SWITCHES AND SELECTS
# =============================================================================
switch:
  - platform: template
    name: Mute
    id: mute
    icon: "mdi:microphone-off"
    optimistic: true
    restore_mode: RESTORE_DEFAULT_OFF
    entity_category: config
    on_turn_off:
      - microphone.unmute:
          id: mic_aec
      - microphone.unmute:
          id: mic_raw
    on_turn_on:
      - microphone.mute:
          id: mic_aec
      - microphone.mute:
          id: mic_raw

  - platform: template
    id: timer_ringing
    optimistic: true
    internal: true
    restore_mode: ALWAYS_OFF
    on_turn_off:
      - lambda: |-
              id(speaker_media_player)
                ->make_call()
                .set_command(media_player::MediaPlayerCommand::MEDIA_PLAYER_COMMAND_REPEAT_OFF)
                .set_announcement(true)
                .perform();
              id(speaker_media_player)->set_playlist_delay_ms(speaker::AudioPipelineType::ANNOUNCEMENT, 0);
      - media_player.stop:
          announcement: true
    on_turn_on:
      - lambda: |-
            id(speaker_media_player)
              ->make_call()
              .set_command(media_player::MediaPlayerCommand::MEDIA_PLAYER_COMMAND_REPEAT_ONE)
              .set_announcement(true)
              .perform();
            id(speaker_media_player)->set_playlist_delay_ms(speaker::AudioPipelineType::ANNOUNCEMENT, 1000);
      - media_player.speaker.play_on_device_media_file:
          media_file: timer_finished_sound
          announcement: true
      - delay: 15min
      - switch.turn_off: timer_ringing

  # Intercom switches
  - platform: intercom_api
    intercom_api_id: intercom
    auto_answer:
      id: auto_answer_switch
      name: "Auto Answer"
      restore_mode: RESTORE_DEFAULT_OFF

  - platform: i2s_audio_duplex
    i2s_audio_duplex_id: i2s_duplex
    aec:
      id: aec_switch
      name: "Echo Cancellation"
      restore_mode: RESTORE_DEFAULT_ON



# =============================================================================
# NUMBERS
# =============================================================================
number:
  - platform: intercom_api
    intercom_api_id: intercom
    mic_gain:
      id: mic_gain
      name: "Mic Gain"

  - platform: template
    id: speaker_volume
    name: "Speaker Volume"
    icon: "mdi:volume-high"
    min_value: 0
    max_value: 80
    step: 5
    initial_value: 80
    optimistic: true
    restore_value: true
    unit_of_measurement: "%"
    set_action:
      - lambda: |-
          float es8311_vol = 0.15 + (x / 100.0) * 0.60;
          id(es8311_dac).set_volume(es8311_vol);
          id(i2s_duplex).set_aec_reference_volume(es8311_vol);




  - platform: template
    name: Screen timeout
    optimistic: true
    id: display_timeout
    unit_of_measurement: "m"
    initial_value: 5 #minutes
    restore_value: true
    min_value: 0 #0 is no timeout
    max_value: 99
    step: 1
    mode: box
# ==============================================================================
# TEXT SENSORS
# ==============================================================================
text_sensor:
  # Subscribe to HA's centralized contacts sensor
  - platform: homeassistant
    id: ha_active_devices
    entity_id: sensor.intercom_active_devices
    on_value:
      - intercom_api.set_contacts:
          id: intercom
          contacts_csv: !lambda 'return x;'


  - platform: wifi_info
    ip_address:
      name: IP Address
      entity_category: diagnostic
    ssid:
      name: Connected SSID
      entity_category: diagnostic
    mac_address:
      name: Mac Address
      entity_category: diagnostic
# ==============================================================================
# DIAGNOSTICS
# ==============================================================================
sensor:


  - platform: uptime
    name: "Uptime"
    update_interval: 60s

  - platform: internal_temperature
    name: "CPU Temperature"
    update_interval: 60s



  - id: wifi_signal_db
    name: WiFi Signal
    platform: wifi_signal
    update_interval: 60s
    entity_category: diagnostic

  - id: wifi_signal_strength
    name: WiFi Strength
    platform: copy
    source_id: wifi_signal_db
    filters:
      - lambda: return min(max(2 * (x + 100.0), 0.0), 100.0);
    unit_of_measurement: "%"
    entity_category: diagnostic


display:
  - platform: mipi_dsi
    id: device_display
    model: JC4880P443
    byte_order: little_endian
    rotation: 90
    lambda: |-
      it.fill(Color::BLACK);
      it.print(340, 100, id(montserrat_28), Color(0,255,0), TextAlign::LEFT, "Hello World1");
      it.print(340, 200, id(montserrat_28), Color::WHITE, TextAlign::LEFT, "Hello World2");
      it.print(340, 300, id(montserrat_28), Color(255,0,0), TextAlign::LEFT, "Hello World3");
touchscreen:
  platform: gt911
  i2c_id: i2c_bus
  id: device_touchscreen
  reset_pin: GPIO3
  update_interval: 100ms
  transform: #This is for 90 degree display rotation
    swap_xy: true
    mirror_x: false
    mirror_y: true
  on_update:
    then:
      - lambda: |-
          if (touches.size() > 0) {
            auto touch = touches[0];
            ESP_LOGI("TOUCH", "X=%d Y=%d", touch.x, touch.y);
          }
esp_ldo:
  - channel: 3
    voltage: 2.5V

psram:
  mode: hex
  speed: 200MHz

preferences:
  flash_write_interval: 5min

esp32_hosted:
  variant: ESP32C6
  reset_pin: GPIO54
  cmd_pin: GPIO19
  clk_pin: GPIO18
  d0_pin: GPIO14
  d1_pin: GPIO15
  d2_pin: GPIO16
  d3_pin: GPIO17
  active_high: true

i2c:
  id: i2c_bus
  sda: 7
  scl: 8
  scan: false
  frequency: 400kHz

# =============================================================================
# AUDIO CODEC (ES8311)
# =============================================================================
audio_dac:
  - platform: es8311
    id: es8311_dac
    bits_per_sample: 16bit
    sample_rate: 16000
    mic_gain: 24DB

Unfortunately, I don’t know how to help you this time. I’ve never touched the P4 ESPs. I’ve used Best Brackets as much as possible for the components, but since I’ve never tested the hardware in question, I have no idea how it works. Lately, I’ve been analyzing the behavior under the hood by observing the ESP status via JTAG. Having one might help me with my tests.

Hi,
its not your fault.
After some investigation I found the culprit because the old device suddenly start to have the same problems.
Some Browsers always use the default mic others let to choose.
My default mic is nvidia broadcaster and that caused the problems.
After using the real mic as input and not the one filtered by nvidia broadcaster the problems are gone.
Anyway the browser where it seems to work the best is Chrome.
But the internal antenna of the JC4880P443C-I-W-Y is not that good as the one from the ESP32-S3 devboard.
Using an external antenna what seams to be possible should help.

I was thinking of getting one of these to do some testing with P4

It looks cool, it even has a chip that provides AEC reference and camera support. You know, I gave up on the camera with the S3. With the ESP IDF, the camera works for a while, then something crashes the ESP. I have several S3 Cams, and I have this problem with all of them. And I don’t understand what causes it.

Yes its cool but much more expansive than the JC4880P443C-I-W-Y.
With coins and voucher i got it for 29,3€ at aliexpress.
Its the version with case and camera.
And external speaker for ~3,5€
But didnt found a way to get the camera working under esphome for now.

I finally ordered it so I can start testing P4 too. Anyway, now that you’ve solved the browser capture problem, can you confirm that the component works well with P4?

Compared to the ESP32-S3 it has weaker Wifi connection with the build in antenna.
For me around -75db at my desk…
The ESP32-S3 has around -65-70db.
Both lying side by side.
I guess that’s the reason for noticable delay on the P4.
The sound itself is better crisp and clear.
The echo cancelation is better too.
From my limited test I would say yes.

But the P4 has stability problems:

2 random reboots during lying on the desk.

The Performance of the camera on the demo that is installed was very nice. It will be good device with reasonable price for intercom stuff if esphome is running stable and after camera support is added.

If anyone is interested, I was able to get it working on the Home Assistant Voice PE hardware. I’m sure it can be optimized further and I haven’t played around with adding additional functionality to the button presses, but intercom calls seem to be working pretty well so far. A lot of the issues with things like mixing the audio was already handled in the core Voice PE code, so I was able to plug into that. I’ve included the configuration I used below:

substitutions:
  name: home-assistant-voice-xxxxxx
  friendly_name: Home Assistant Voice xxxxxx
packages:
  Nabu Casa.Home Assistant Voice PE: 
    url: https://github.com/esphome/home-assistant-voice-pe
    file: home-assistant-voice.yaml
    refresh: 0s
esphome:
  name: ${name}
  name_add_mac_suffix: false
  friendly_name: ${friendly_name}

wifi:
  ssid: !secret wifi_ssid
  password: !secret wifi_password


# =============================================================================
# CONNECTIVITY
# =============================================================================
api:
  on_client_connected:
    - lambda: 'id(intercom).publish_entity_states();'
  encryption:
    key: <your_key_here>

# 2. Add the Intercom-specific external components
external_components:
  - source:
      type: local
      path: intercom-api/esphome_components
    #components: [intercom_api]
    components: [intercom_api, esp_aec]

esp_aec:
  id: aec_processor
  sample_rate: 16000
  filter_length: 4
  mode: voip_low_cost

intercom_api:
  id: intercom
  mode: full
  microphone: i2s_mics
  speaker: media_resampling_speaker
  mic_bits: 32
  dc_offset_removal: true
  aec_id: aec_processor
  ringing_timeout: 30s

  on_incoming_call:
    - light.turn_on:
        brightness: 100%
        id: voice_assistant_leds
        effect: "Rainbow"

  on_outgoing_call:
    - light.turn_on:
        brightness: 100%
        id: voice_assistant_leds
        effect: "Rainbow"
    - if:
        condition:
          lambda: 'return id(intercom).get_current_destination() == "Home Assistant";'
        then:
          - homeassistant.event:
              event: esphome.intercom_call
              data:
                caller: !lambda 'return App.get_friendly_name();'
                destination: "Home Assistant"
                type: "doorbell"

  on_ringing:
    - light.turn_on:
        brightness: 100%
        id: voice_assistant_leds
        effect: "Rainbow"

  on_answered:
    - logger.log: "Call answered"

  on_streaming:
    - light.turn_on:
        id: voice_assistant_leds
        effect: none
        red: 0%
        green: 100%
        blue: 0%

  on_idle:
    - light.turn_off: voice_assistant_leds

  on_hangup:
    - logger.log:
        format: "Hangup: %s"
        args: ['reason.c_str()']

  on_call_failed:
    - logger.log:
        format: "Call failed: %s"
        args: ['reason.c_str()']

# =============================================================================
# BUTTONS
# =============================================================================
button:
  - platform: template
    id: call_button
    name: "Call"
    icon: "mdi:phone"
    on_press:
      - intercom_api.call_toggle:
          id: intercom

  - platform: template
    id: next_contact_button
    name: "Next Contact"
    icon: "mdi:arrow-right"
    on_press:
      - intercom_api.next_contact:
          id: intercom

  - platform: template
    id: prev_contact_button
    name: "Previous Contact"
    icon: "mdi:arrow-left"
    on_press:
      - intercom_api.prev_contact:
          id: intercom

  - platform: template
    id: decline_button
    name: "Decline"
    icon: "mdi:phone-hangup"
    on_press:
      - intercom_api.decline_call:
          id: intercom

  - platform: template
    id: refresh_contacts_button
    name: "Refresh Contacts"
    icon: "mdi:refresh"
    entity_category: config
    on_press:
      - intercom_api.set_contacts:
          id: intercom
          contacts_csv: !lambda 'return id(ha_active_devices).state;'


# =============================================================================
# SWITCHES
# =============================================================================
switch:
  - platform: intercom_api
    intercom_api_id: intercom
    auto_answer:
      id: auto_answer_switch
      name: "Auto Answer"
      restore_mode: RESTORE_DEFAULT_ON
    aec:
      id: aec_switch
      name: "Echo Cancellation"
      restore_mode: RESTORE_DEFAULT_OFF

# =============================================================================
# NUMBERS
# =============================================================================
number:
  - platform: intercom_api
    intercom_api_id: intercom
    speaker_volume:
      id: speaker_volume
      name: "Speaker Volume"
    mic_gain:
      id: mic_gain
      name: "Mic Gain"

# =============================================================================
# TEXT SENSORS
# =============================================================================
text_sensor:
  - platform: homeassistant
    id: ha_active_devices
    entity_id: sensor.intercom_active_devices
    on_value:
      - intercom_api.set_contacts:
          id: intercom
          contacts_csv: !lambda 'return x;'

Hi, thanks for the feedback. I’ll start by saying I’ve never tested Voice PE. However, after doing some research, I understand it has a hardware codec that handles echo cancellation internally, and it also uses a very specific microphone array system. In my opinion, it delivers clean audio. In this case, I think you can safely avoid using the AEC component from my repo. Have you done any tests on this? I’ve never had much time to test the actual modularity of the various components. If you do run some tests, let me know if everything works correctly without AEC. Best regards.

Would it be possible to get this awesome project into the official esphome project?

Can you logically split the relevant improvements into different Pull Requests for esphome?

Hi, I really hope so. It started as an experiment, and it’s become something quite stable and useful, something I didn’t expect at all. I opened this discussion on esphome’s Git. Maybe voting for it or showing interest in it in some way might help: [New Component] Full-Duplex Intercom for ESPHome — ESP32 two-way audio with AEC, Voice Assistant coexistence, and Home Assistant integration Ā· esphome Ā· Discussion #3461 Ā· GitHub

Tell us, what hardware did you test on?

1 Like

v2.1.0 released — 48kHz audio, MWW detection fixed, reliability improvements
Hey everyone, v2.1.0 is out! :tada:
Release: Release v2.1.0 Ā· n-IA-hane/intercom-api Ā· GitHub
Here’s a quick overview of what’s new:
:loud_sound: 48kHz I2S bus with FIR decimation
The I2S bus now runs at 48kHz (ES8311 native rate), which makes a noticeable difference in TTS and media audio quality. A 31-tap FIR anti-alias filter decimates mic/AEC/VA/intercom paths down to 16kHz internally, while the speaker path stays at 48kHz end-to-end. New output_sample_rate config option — fully backward compatible (omit it and nothing changes).
:brain: FreeRTOS task layout overhaul — MWW detection restored
This was a big one. The audio task moved from Core 1 to Core 0 (priority 19), matching the canonical Espressif AEC pattern. MWW inference now schedules freely on Core 1 without AEC interference. Result: 10/10 wake word detection during TTS playback (was ~1/10 before). LVGL rendering also benefits since it’s no longer preempted by AEC every 16ms.
:hammer_and_wrench: Audio reliability fixes
A code audit eliminated several race conditions and stuck-state bugs: proper ERROR message handling (was a no-op leaving ESP stuck in OUTGOING), DC offset IIR state now resets between sessions, TOCTOU fixes on socket/state access, and duplicate STOP removal.
:broom: Code cleanup & HA integration refactor
Triggers unified (on_incoming_call → on_ringing, on_call_end removed). The intercom_native HA integration got a major refactor — TCP session callbacks extracted into proper IntercomSession methods, dead code removed, frontend debug output cleaned up. Manifest bumped to 2.0.5.
:desktop_computer: Display & UI fixes
SPI clock bumped to 40MHz (halves GC9A01A flush time), instant page transitions, and fixed a bug where stale VA response text would persist on screen.
What’s next (v2.2.0+):
Two new Waveshare boards arrived (ESP32-S3-AUDIO-Board and ESP32-P4-WiFi6-Touch-LCD), both using ES8311 + ES7210. Upcoming work includes ES7210 4-channel ADC support, ESP-AFE integration as an alternative to esp_aec, and ESP32-P4 testing.
As always, feedback, bug reports, and contributions are welcome!

1 Like

:tv: ESP32-P4 10.1" touch display support
The Waveshare ESP32-P4-WiFi6-Touch-LCD-10.1 now runs the full stack: Voice Assistant, Micro Wake Word, Intercom, and LVGL display. ESP32-P4 RISC-V dual-core, 32MB
PSRAM, 10.1" MIPI DSI capacitive touch, ES8311 + ES7210, WiFi via ESP32-C6 co-processor. Ready-to-flash YAML included.

:art: Split-screen UI
The 10.1" portrait display (800x1280) is divided in two halves:

  • Top: swipeable tileview — weather page (current conditions, MDI icons, 5-day forecast via weather.get_forecasts) and intercom page (contacts, call controls,
    dynamic state groups)
  • Bottom: Voice Assistant with animated avatar (touch to talk), per-state images (listening, thinking, error), and mood-based expressions during replies
    (happy/neutral/angry parsed from LLM response)

https://github.com/n-IA-hane/intercom-api/raw/main/readme-img/p4-va-weather.jpg
https://github.com/n-IA-hane/intercom-api/raw/main/readme-img/p4-va-intercom.jpg

:bell: Ringtone on incoming calls
Devices now play a looping ringtone sound while ringing. Stops automatically on answer, decline, or timeout.

v2.1.1

:microphone: Waveshare ESP32-S3-AUDIO-Board support
New board with ES7210 4-channel ADC alongside the ES8311. Hardware-synced AEC reference via TDM — MIC3 captures the DAC analog output directly from the same I2S
frame. No ring buffer delay tuning needed. Ready-to-flash YAML included.

:gear: TDM AEC reference mode
New use_tdm_reference: true option in i2s_audio_duplex for ES7210 boards. The reference signal is sample-aligned with mic data by design — just set the slot numbers
and it works.

:wrench: Reliability fixes
Ring buffer race conditions (atomic request flags), AEC buffer allocation checks, task cleanup fixes, I2S persistent error recovery, and ~25-32KB RAM savings by
skipping unused speaker ref buffer in stereo/TDM modes.

:white_check_mark: Tested hardware

  • :crystal_ball: Xiaozhi Ball V3 — ES8311 single bus, AEC via digital feedback (stereo)
  • :speaker: Waveshare ESP32-S3-AUDIO — ES8311 + ES7210 TDM, AEC via hardware analog (MIC3)
  • :desktop_computer: Waveshare ESP32-P4 10.1" Touch LCD — ES8311 + ES7210, LVGL split-screen UI
  • :radio: ESP32-S3 Mini — SPH0645 + MAX98357A dual bus, no display

All configs include Voice Assistant + Micro Wake Word + Intercom with ready-to-flash YAMLs.

Feedback and bug reports welcome on GitHub Ā· Where software is built

2 Likes

I like your new updates.
Last days I didnt had much time to work on/with it but because im stuck in other projects but I will start with that one using your intercom soon.

For that I have one question.
What kind of Software / Addon you use for that visual stuff on the Screen (LVGL and so on). Is it also Claude ? Which model?
I was struggeling hard with my JC4880P443C-I-W-Y to get any usefull on the screen so help from an Editor with AI integrated would be nice :smiley:

Hi, nice to hear from you again. I use Claude Code on Arch Linux and recently also OpenClaw with GPT 5.3-Codex. I often have them work together as antagonists, so they correct each other and the result is better. I’m not new to programming, I regularly audit the generated code and always find something to fix. I’m learning LVGL on my own through trial and error since AIs don’t know it well, and in the meantime I’m also getting into hardware analysis with JTAG. Next I’d like to try integrating the AFE framework for beamforming and noise suppression, but it’s a tough challenge given the limited resources of ESP chips. If I manage it, audio support for ESPHome would really be in great shape.

Ah I understand so more professional background :).
I know its offtopic but OpenClaw and Claude Code in same project working together sounds very interesting :slight_smile: you have it automated or you copy from one to the other and ask for feedback?
Looking forward to test it :smiley:

Easier than you might think, Openclaw can be called interactively. Type, reply, or use the command openclaw -id conversation (I don’t remember exactly, I don’t have a PC handy now). It replies and exits. Both operate in the same folder. If you use an LLM locally from a shell, it’s not confined to an editor or IDE, so it potentially has access to the entire system it runs on. Both have access to the same folder, so they can examine the source code independently; one doesn’t have to pass it to the other. One LLM calls the other via the command line, passing it the tasks to do and interpreting the reciprocal responses. Basically, on Linux you really do have one or more entities that have total omnipotence over the entire system. If you want, you can ask them to ā€œUpdate the system with pacman (Arch)ā€ or (if you have installed ssh keys between your systems) ā€œLog in via ssh to the lxc I use for speech recognition and update using the update command.ā€ It’s interesting to see the tricks they use to move around. Let’s take the last hypothetical command I told you about. Maybe I didn’t specify the IP address of the target machine, but it knows roughly what it has to do. It will probably start an mdns scan, or an IP scan, analyze the hostnames it finds, examine them, maybe it sees a hostname ā€œwhisper-cppā€ and will guess that it is the one in charge of speech recognition. It will log in via ssh, update the system, and receive the command response, all with a single command. Typically, they chain together multiple commands to do something all at once. Things have improved significantly now. In the early days, they were a disaster. A ā€œtest this thing for me in Home Assistant, use SSH and test the web part with this Long Live Tokenā€ and you’d end up with a completely broken Home Assistant instance, hundreds of conflicting dummy automations, modified components, etc. If an LLM has a hallucination in chat, no big deal, he’ll give you the wrong answers; if it happens on an operating system, you’ve won a formatting. I don’t know if I’m making myself clear :blush:

Yes I fully understand what you are telling me.
Thank you very much.
Until now I was just working with AI’s inside the IDE only access to specific folders. I never gave them Systemaccess.
But it’s an interesting approach. To look deeper and explore that possibilities just moved on my to-do list.

@meconiotech I’ve been using the esp_aec and i2s_audio_duplex components on more and more devices and it’s been working perfectly. Any thoughts to pushing at least those two components upstream to esphome?

1 Like

Hi, nice to hear from you again. First of all, I’d like to thank you personally for your feedback. People usually write to me to get support or complain about something; I have no idea how well it actually works for others or how. I know I’m having a lot of fun working on it; I have new ideas every day. For example, I’ll soon release a trainer for Micro Wake Words so everyone can create their own MWW. I’d also like to achieve another goal: to have the voice assistant call another intercom like ā€œAlexa calls the kitchen,ā€ but I’ve been short on free time lately; what I do have is spent improving and updating i2s duplex audio. As always, if anyone wants to help me with this, I’d be more than welcome. Regarding PR, I really hope to succeed in the near future. I’ll research how to do it thoroughly and give it a try.