🔔 ESPHome Full-Duplex Audio Intercom - Because I Was Bored on Vacation

Hi Jakub, thanks for the feedback!

You’re right. The filter_length should only need to cover the actual acoustic echo tail (speaker to mic physical path), not compensate for software buffering
alignment.

The ring buffer reference mode uses a fixed pre-fill delay (currently hardcoded at 80ms) to align speaker samples with mic data. If that delay doesn’t match your
hardware’s actual DMA latency, the AEC filter has to search a wider window, requiring higher filter_length (and more CPU).

We’ve just added a configurable delay on dev: aec_reference_delay_ms so you can tune the pre-fill for your hardware. If you find the right value for your setup,
filter_length: 4 should work.

intercom_api:
id: intercom
mode: full
microphone: mic_component
speaker: spk_component
dc_offset_removal: true
aec_id: aec_processor
aec_reference_delay_ms: 80 # Tune for your hardware (10-200ms). Try lower values with filter_length: 4.
ringing_timeout: 30s

The default is 80ms which works for most setups with separate I2S buses. Try lowering it (40-60ms) and see if filter_length: 4 gives you good cancellation. The right
value depends on your I2S DMA buffer configuration and codec latency.

We’re also exploring a single-bus duplex approach (i2s_audio_duplex) where mic and speaker share the same I2S bus (same BCLK/LRCLK, separate DIN/DOUT), giving
sample-aligned reference without ring buffer delay. Currently tested with codec hardware (ES8311/ES7210), working on a no-codec variant.

Can you share your hardware setup? (which mic, speaker, codec if any, single or dual I2S bus?) That would help us suggest the best configuration.

Thank you for fast response.
Actually i’m on dev branch already. I did some experiments with aec_reference_delay_ms maybe i missed sweet spot i will do some more or maybe try to measure delay somehow.
I’m mostly interested in i2s_audio_duplex and aec components, my goal is to improve voice recognition for custom esphome speaker with voice assistant while music is playing. I’m using inmp441 and max98357 on single i2s. To validate improvement i’m streaming pre aec and post aec microphones data to PC. As long as mic is not extremly overdriven your code does really good job in canceling speaker noise. I will get back to filter_lenght: 4 and experiment further. Thanks for clearing my wrong understanding of speaker ref delay and filter lenght corelation.

Is there a way to use this with just an I2S mic (ICS-43434) and stream the audio over TCP? Maybe even the raw data or a rtsp stream? ATM I push the raw mic data over UDP to a server running go2rtc to make a rtsp stream, but if this could replace part or all of that boilerplate that would be awesome!

udp:
  - id: nuc_udp
    addresses: ["192.168.178.12"]
    port: 4800

microphone:
  - platform: i2s_audio
    id: board_mic
    i2s_audio_id: i2s_in
    i2s_din_pin: GPIO2
    adc_type: external
    pdm: false
    sample_rate: 48000
    bits_per_sample: 16bit
    channel: left
    on_data:
      - udp.write:
        id: nuc_udp
        data: !lambda 'return x;'