I think a new ESP32 Voice Assistant discussion is worth revisiting now that the ESPHome Speaker Mixer has become the standard approach.
Following tutorials from @tpage, @lmatter and others, I built my own ESP32-S3-based Voice Assistant using the Speaker Mixer, a microphone, wake word detection, and Voice Assistant.
I repurposed a HAS-useless Xiaomi Gateway
The issue I'm facing is that the ESP32-S3 SuperMini only has a single I2S controller available for my setup. The microphone is always active by default, continuously capturing audio for wake word detection. Whenever audio playback starts (WAV files, announcements, TTS responses, or media playback), the microphone should stop and release the I2S bus so the speaker can take ownership of it.
The problem is that I haven't found a reliable way to receive a "pre-playback" event. Triggers such as on_play, on_turn_on, or on_state seem to occur after the speaker pipeline has already started allocating resources, which is too late. At that point, the I2S bus is still owned by the microphone and I get allocation/collision errors.
My ideal flow would be:
- Stop wake word detection.
- Stop Voice Assistant if running.
- Stop microphone capture.
- Wait for the I2S RX channel to be released.
- Start speaker playback.
- When playback finishes, stop the speaker.
- Re-enable microphone capture.
- Restart wake word detection.
Has anyone successfully implemented this with a shared I2S bus between microphone and speaker on an ESP32-S3 Supermini or similar?
Is there an existing pattern, component hook, or recommended architecture for handling I2S ownership transitions before audio playback begins?
Hardware
| Component | Model |
|---|---|
| MCU | ESP32-S3 SuperMini |
| Microphone | INMP441 I2S MEMS Microphone |
| Amplifier | MAX98357A I2S DAC/Amplifier |
| Speaker | 4Ξ© / 3W Speaker |
| RGB LED | RGB LED Strip |
| Button | GPIO Push Button |
Pinout
| Function | GPIO |
|---|---|
| I2S LRCLK / WS | GPIO8 |
| I2S BCLK / SCK | GPIO9 |
| INMP441 SD (Data Out) | GPIO7 |
| MAX98357A DIN (Data In) | GPIO10 |
| Push Button | GPIO6 |
| RGB LED Red | GPIO11 |
| RGB LED Green | GPIO12 |
| RGB LED Blue | GPIO13 |
Audio Topology
ESP32-S3 SuperMini
Shared I2S Control Bus
GPIO8 (LRCLK / WS)
GPIO9 (BCLK / SCK)
β
ββββββββββββββββ΄βββββββββββββββ
β β
βΌ βΌ
INMP441 Microphone MAX98357A Amplifier
GPIO7 (SD) GPIO10 (DIN)
I2S RX Data I2S TX Data
Software Stack
-
ESPHome 2026.5
-
Home Assistant Voice Assistant
-
Speaker Mixer
-
micro_wake_word -
Shared I2S bus for microphone and speaker
-
Wake word always active while idle
-
Speaker used for:
- TTS responses
- Announcement pipeline
- WAV playback
- Media playback
I'm including my ESPHome configuration below in case it helps others reproduce the issue.
substitutions:
device_name: hall-speaker
friendly_name: Hall Speaker
esphome:
name: ${device_name}
friendly_name: ${friendly_name}
on_boot:
priority: 600.0
then:
- speaker.stop: va_speaker_hw
- microphone.stop_capture: va_mic
esp32:
board: esp32-s3-devkitc-1
framework:
type: esp-idf
sdkconfig_options:
CONFIG_ESP32S3_DEFAULT_CPU_FREQ_240: "y"
CONFIG_ESP32S3_DATA_CACHE_64KB: "y"
CONFIG_ESP32S3_DATA_CACHE_LINE_64B: "y"
# Enable logging
logger:
# Enable Home Assistant API
api:
encryption:
key: ""
on_client_connected:
then:
- light.turn_on:
id: light1
brightness: 25%
red: 0%
green: 50.9%
blue: 98.8%
effect: "Soft Breath"
- delay: 800ms
- light.turn_off: light1
ota:
- platform: esphome
password: ""
wifi:
ssid: !secret wifi_ssid
password: !secret wifi_password
power_save_mode: none
manual_ip:
static_ip: 192.168.1.51
gateway: 192.168.1.1
subnet: 255.255.255.0
dns1: 1.1.1.1
dns2: 8.8.8.8
# Enable fallback hotspot (captive portal) in case wifi connection fails
ap:
ssid: "${friendly_name} Hotspot"
password: ""
network:
enable_high_performance: false # Avoid aggressive high-performance networking to save RAM
captive_portal:
web_server:
binary_sensor:
- platform: gpio
name: "Button 1"
id: button_1
pin:
number: GPIO6
mode:
input: true
pullup: true
inverted: true
on_press:
- if:
condition:
switch.is_on: mute_mic_sw
then:
- switch.turn_off: mute_mic_sw
else:
- switch.turn_on: mute_mic_sw
output:
- platform: ledc
id: output_led_red
pin: GPIO11
frequency: 1220 Hz
- platform: ledc
id: output_led_green
pin: GPIO12
frequency: 1220 Hz
- platform: ledc
id: output_led_blue
pin: GPIO13
frequency: 1220 Hz
light:
- platform: rgb
id: light1
name: "LED bar"
red: output_led_red
green: output_led_green
blue: output_led_blue
default_transition_length: 500ms
gamma_correct: 0
restore_mode: ALWAYS_OFF
effects:
- pulse:
name: "Soft Breath"
min_brightness: 0%
max_brightness: 25%
transition_length:
on_length: 1s
off_length: 1500ms
update_interval: 2s
button:
- platform: restart
name: "Reinicio ESP32 Media Player"
- platform: template
name: "Test Speaker WAV"
on_press:
- script.execute: audio_enter_speaker_mode
- script.wait: audio_enter_speaker_mode
- media_player.play_media:
id: va_mediaplayer
media_url: "http://192.168.1.101:8123/local/beep.wav"
announcement: true
switch:
- platform: template
id: mute_mic_sw
name: "Mute microphone"
optimistic: true
on_turn_on:
- script.execute: mute_mic
on_turn_off:
- script.execute: unmute_mic
script:
- id: audio_enter_speaker_mode
mode: single
then:
- logger.log: "[AUDIO] Entering speaker mode: stopping wake word, VA and microphone"
- micro_wake_word.stop:
- voice_assistant.stop:
- delay: 300ms
- microphone.stop_capture: va_mic
- delay: 500ms # Give the I2S RX side time to release before the speaker starts using TX.
- id: audio_leave_speaker_mode
mode: single
then:
- logger.log: "[AUDIO] Leaving speaker mode: stopping hardware speaker and restoring microphone if allowed"
- speaker.stop: va_speaker_hw # Stop the real I2S hardware speaker, not the mixer.
- delay: 700ms # Give the I2S TX side time to release before restoring RX/microphone.
- if:
condition:
and:
- switch.is_off: mute_mic_sw
- not:
voice_assistant.is_running:
- not:
media_player.is_playing: va_mediaplayer
then:
- logger.log: "[AUDIO] Restoring microphone capture + wake word"
- microphone.capture: va_mic
- delay: 250ms
- micro_wake_word.start:
else:
- logger.log: "[AUDIO] Microphone restore skipped: muted, VA running or media still playing"
- id: mute_mic
mode: single
then:
- logger.log: "[MIC] Muting microphone"
- micro_wake_word.stop:
- voice_assistant.stop:
- delay: 300ms
- microphone.stop_capture: va_mic
- light.turn_on:
id: light1
red: 100%
green: 30%
blue: 18%
brightness: 20%
transition_length: 800ms
- id: unmute_mic
mode: single
then:
- logger.log: "[MIC] Unmuting microphone"
- light.turn_off:
id: light1
transition_length: 300ms
- script.execute: audio_leave_speaker_mode
i2s_audio:
- id: i2s
i2s_lrclk_pin: GPIO8
i2s_bclk_pin: GPIO9
microphone:
- platform: i2s_audio
id: va_mic
i2s_audio_id: i2s
adc_type: external
i2s_din_pin: GPIO7
channel: left
pdm: false
speaker:
- platform: i2s_audio
id: va_speaker_hw
i2s_audio_id: i2s
dac_type: external
i2s_dout_pin: GPIO10
channel: mono
bits_per_sample: 16bit
sample_rate: 16000
- platform: mixer
id: va_speaker
output_speaker: va_speaker_hw
source_speakers:
- id: va_speaker_announcement
- id: va_speaker_media
media_player:
- platform: speaker
id: va_mediaplayer
name: "Corridor Speaker"
buffer_size: 50000
announcement_pipeline:
speaker: va_speaker_announcement
format: WAV # Ask Home Assistant to transcode to a lowβcost WAV stream
#format: FLAC
num_channels: 1
sample_rate: 16000
media_pipeline:
speaker: va_speaker_media
format: WAV
num_channels: 1
sample_rate: 16000
on_turn_on:
then:
- logger.log: "[MEDIA] Media Player turned on"
- script.execute: audio_enter_speaker_mode
- script.wait: audio_enter_speaker_mode
on_play:
then:
- logger.log: "Media playback started."
on_idle:
then:
- logger.log: "[MEDIA] Playback finished"
- delay: 5s
- if:
condition:
media_player.is_idle: va_mediaplayer
then:
- media_player.turn_off: va_mediaplayer
else:
- logger.log: "[MEDIA] Media Player not idle, not turning off"
on_turn_off:
then:
- logger.log: "[MEDIA] Media Player turned off"
- script.execute: audio_leave_speaker_mode
micro_wake_word:
id: my_micro_wake_word
vad:
model: github://esphome/micro-wake-word-models/models/v2/vad.json
models:
- model: github://esphome/micro-wake-word-models/models/v2/hey_jarvis.json
on_wake_word_detected:
- micro_wake_word.stop:
- voice_assistant.start:
wake_word: !lambda return wake_word;
- light.turn_on:
id: light1
red: 0%
green: 0%
blue: 100%
brightness: 30%
voice_assistant:
id: va
microphone: va_mic
media_player: va_mediaplayer
noise_suppression_level: 2
volume_multiplier: 3.0
auto_gain: 31dBFS
micro_wake_word: my_micro_wake_word
use_wake_word: true
on_start:
- logger.log: "[VA] Starting"
- micro_wake_word.stop:
- light.turn_on:
id: light1
red: 0%
green: 100%
blue: 0%
brightness: 50%
on_listening:
- logger.log: "[VA] Listening"
- light.turn_on:
id: light1
red: 0%
green: 0%
blue: 100%
brightness: 100%
effect: "Soft Breath"
on_stt_end:
then:
- logger.log: "[VA] STT END"
- light.turn_off: light1
- script.execute: audio_enter_speaker_mode
on_error:
- logger.log: "[VA] ERROR"
- delay: 1s
- script.execute: audio_leave_speaker_mode
on_end:
then:
- logger.log: "[VA] END"
- light.turn_off: light1
- wait_until:
not:
voice_assistant.is_running:
- delay: 500ms
- script.execute: audio_leave_speaker_mode
