(voice assistant, micro-wake-word, speech-to-phrase)
I tried to build a voice assistant based on “# Voice assistant based on Voice Assistant PE - https://github.com/KristopherMackowiak/ha_voice_assistant/” (only with one speaker instead of two), but so far with only moderate success. Speech recognition works, but the whole system crashes very frequently and there is no sound coming from the speaker.
I have tried many variations for speaker:, media_player: and micro_wake_word: — sometimes bits_per_sample: 32bit, sometimes 16bit, sometimes sample_rate: 16000, sometimes 48000. Sometimes you can hear the start sound, but currently mostly not. Occasionally you hear a response, but only once, and then silence until the next crash. Playing sounds or radio from the media player does not work at all. I suspect the problems somehow occur when playing sound, because everything works up until the first sound output. What am I doing wrong?
.
.
.
i2s_audio:
- id: sz_i2s_output
# i2s_output data pin is GPIO10
i2s_lrclk_pin:
number: GPIO7 #LRC an MAX98357A
i2s_bclk_pin:
number: GPIO8 #BCL an MAX98357A
- id: sz_i2s_input
# data line is GPIO15
i2s_lrclk_pin:
number: GPIO14 #WS am Mikrofon INMP441
i2s_bclk_pin:
number: GPIO13 #SCK am Mikrofon INMP441
microphone:
- platform: i2s_audio
id: sz_i2s_mics
i2s_din_pin: GPIO15
adc_type: external
i2s_audio_id: sz_i2s_input
channel: left #L/R Pin INMP441 => GND
speaker:
# Hardware speaker output
- platform: i2s_audio
id: sz_i2s_audio_speaker
sample_rate: 48000 #16000 #48000
i2s_mode: primary #Voice-PE secondary
i2s_dout_pin: GPIO10
bits_per_sample: 32bit #16bit # 32bit
i2s_audio_id: sz_i2s_output
dac_type: external
channel: mono # stereo
timeout: never
buffer_duration: 100ms
# Virtual speakers to combine the announcement and media streams together into one output
- platform: mixer
id: sz_mixing_speaker
output_speaker: sz_i2s_audio_speaker
num_channels: 1 #2
source_speakers:
- id: sz_announcement_mixing_input
timeout: never
- id: sz_media_mixing_input
timeout: never
# Vritual speakers to resample each pipelines' audio, if necessary, as the mixer speaker requires the same sample rate
- platform: resampler
id: sz_announcement_resampling_speaker
output_speaker: sz_announcement_mixing_input
sample_rate: 48000 #16000 #48000
bits_per_sample: 16
- platform: resampler
id: sz_media_resampling_speaker
output_speaker: sz_media_mixing_input
sample_rate: 48000 #16000 #8000
bits_per_sample: 16
media_player:
- platform: speaker
id: external_media_player
name: Schlafzimmer Media Player
internal: False
volume_increment: 0.05
volume_min: 0.4
volume_max: 1.0 #0.85
# buffer_size: 40000 #Must be between 4000 and 4000000. Defaults to 100000
codec_support_enabled: true # set to flase to save and specify format to save recources
announcement_pipeline:
speaker: sz_announcement_resampling_speaker
# format: FLAC # FLAC is the least processor intensive codec
num_channels: 1 # Stereo audio is unnecessary for announcements
sample_rate: 48000 #16000 #48000
media_pipeline:
speaker: sz_media_resampling_speaker
# format: FLAC # FLAC is the least processor intensive codec
num_channels: 1 #2
sample_rate: 48000 #16000 #48000
on_mute:
- script.execute: control_leds
on_unmute:
- script.execute: control_leds
on_volume:
- script.execute: control_leds
on_announcement:
.
.
.
micro_wake_word:
id: schlafzimmer_mww
microphone:
microphone: sz_i2s_mics
stop_after_detection: false
models:
- model: Schlafzimmer_Resources/MWW/okay_nabu.json
id: okay_nabu
.
.
.
I’m currently using a lab power supply. I’ve already increased it to a maximum of 1.2A. That should be enough for a voice assistant with a 2W speaker and 12 LEDs, right?
If switch on the power there comes short a sound from the speaker and after that silence. It is not possible to play something via the media player and not voice assistant answers are played…
What settings need to be changed for sound to play through the speaker? Sometimes I get one or two sounds from the Assistant, but mostly not. Occasionally I can play an MP3 from the media player a few times, but usually not. I can’t find any pattern or connection between the two.
[14:18:19.928][I][app:190]: ESPHome version 2025.11.2 compiled on Nov 30 2025, 14:10:36
[14:18:19.928][C][logger:261]: Logger:
[14:18:19.928][C][logger:261]: Max Level: VERBOSE
[14:18:19.928][C][logger:261]: Initial Level: VERBOSE
[14:18:19.928][C][logger:267]: Log Baud Rate: 115200
[14:18:19.928][C][logger:267]: Hardware UART: UART0
[14:18:19.929][C][logger:274]: Task Log Buffer Size: 768
[14:18:19.946][C][logger:280]: Level for 'api': VERBOSE
[14:18:19.967][C][template.switch:092]: Template Switch 'Use Marvin wake word'
[14:18:19.967][C][template.switch:092]: Restore Mode: always OFF
[14:18:19.968][C][template.switch:056]: Optimistic: YES
[14:18:19.985][C][psram:016]: PSRAM:
[14:18:19.988][C][psram:019]: Available: YES
[14:18:19.988][C][psram:021]: Size: 8192 KB
[14:18:20.008][C][i2s_audio.microphone:079]: Microphone:
[14:18:20.008][C][i2s_audio.microphone:079]: Pin: 15
[14:18:20.008][C][i2s_audio.microphone:079]: PDM: NO
[14:18:20.008][C][i2s_audio.microphone:079]: DC offset correction: NO
[14:18:20.022][C][status:018]: Status Binary Sensor 'API Connection'
[14:18:20.023][C][status:021]: Device Class: 'connectivity'
[14:18:20.040][C][i2s_audio.speaker:074]: Speaker:
[14:18:20.040][C][i2s_audio.speaker:074]: Pin: 7
[14:18:20.040][C][i2s_audio.speaker:074]: Buffer duration: 500
[14:18:20.040][C][i2s_audio.speaker:080]: Timeout: 500 ms
[14:18:20.040][C][i2s_audio.speaker:088]: Communication format: std
[14:18:20.057][C][captive_portal:122]: Captive Portal:
[14:18:20.074][C][wifi:1062]: WiFi:
[14:18:20.074][C][wifi:1062]: Connected: YES
[14:18:20.074][C][wifi:827]: Local MAC: 8e:C5:4E:C3:1C:4D
[14:18:20.076][C][wifi:834]: IP Address: 192.168.178.76
[14:18:20.078][C][wifi:838]: SSID: 'XXXXX-FritzBox'[redacted]
[14:18:20.078][C][wifi:838]: BSSID: 3C:81:CB:05:91:AA[redacted]
[14:18:20.078][C][wifi:838]: Hostname: 'marvin'
[14:18:20.078][C][wifi:838]: Signal strength: -52 dB ▂▄▆█
[14:18:20.078][C][wifi:838]: Channel: 6
[14:18:20.078][C][wifi:838]: Subnet: 255.255.255.0
[14:18:20.078][C][wifi:838]: Gateway: 192.168.178.1
[14:18:20.078][C][wifi:838]: DNS1: 192.168.178.1
[14:18:20.078][C][wifi:838]: DNS2: 0.0.0.0
[14:18:20.099][C][esphome.ota:093]: Over-The-Air updates:
[14:18:20.099][C][esphome.ota:093]: Address: marvin.local:3232
[14:18:20.099][C][esphome.ota:093]: Version: 2
[14:18:20.107][C][esphome.ota:100]: Password configured
[14:18:20.111][C][safe_mode:018]: Safe Mode:
[14:18:20.111][C][safe_mode:018]: Successful after: 60s
[14:18:20.111][C][safe_mode:018]: Invoke after: 10 attempts
[14:18:20.111][C][safe_mode:018]: Duration: 300s
[14:18:20.123][C][web_server.ota:241]: Web Server OTA
[14:18:20.130][C][api:223]: Server:
[14:18:20.130][C][api:223]: Address: marvin.local:6053
[14:18:20.130][C][api:223]: Listen backlog: 4
[14:18:20.130][C][api:223]: Max connections: 8
[14:18:20.130][C][api:230]: Noise encryption: YES
[14:18:20.139][C][mdns:177]: mDNS:
[14:18:20.139][C][mdns:177]: Hostname: marvin
[14:18:20.141][V][mdns:182]: Services:
[14:18:20.143][V][mdns:184]: - _esphomelib, _tcp, 6053
[14:18:20.152][V][mdns:187]: TXT: friendly_name = Marvin
[14:18:20.153][V][mdns:187]: TXT: version = 2025.11.2
[14:18:20.153][V][mdns:187]: TXT: mac = 81b56ec31e3d
[14:18:20.161][V][mdns:187]: TXT: platform = ESP32
[14:18:20.161][V][mdns:187]: TXT: board = esp32-s3-devkitc-1
[14:18:20.171][V][mdns:187]: TXT: network = wifi
[14:18:20.180][V][mdns:187]: TXT: api_encryption = Noise_NNpsk0_25519_ChaChaPoly_SHA256
[14:18:22.506][D][voice_assistant:624]: Event Type: 10
[14:18:22.507][D][voice_assistant:641]: Wake word detected
[14:18:22.510][D][main:578]: =>> on_wake_word_detected: Voice assistant has detected the wakeword!!
[14:18:22.510][D][voice_assistant:624]: Event Type: 3
[14:18:22.510][D][voice_assistant:646]: STT started
[14:18:22.523][D][main:568]: =>> on_listening: Voice assistant is listening...
[14:18:22.789][D][voice_assistant:624]: Event Type: 11
[14:18:22.791][D][voice_assistant:827]: Starting STT by VAD
[14:18:25.421][D][voice_assistant:624]: Event Type: 12
[14:18:25.421][D][voice_assistant:831]: STT by VAD end
[14:18:25.421][D][voice_assistant:478]: State changed from STREAMING_MICROPHONE to STOP_MICROPHONE
[14:18:25.421][D][voice_assistant:485]: Desired state set to AWAITING_RESPONSE
[14:18:25.425][D][voice_assistant:478]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE
[14:18:25.439][D][voice_assistant:478]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE
[14:18:25.440][V][i2s_audio.microphone:486]: Task finished, freeing resources and uninstalling driver
[14:18:25.876][D][voice_assistant:624]: Event Type: 4
[14:18:25.876][D][voice_assistant:663]: Speech recognised as: "Stehlampe ausschalten"
[14:18:25.878][D][voice_assistant:624]: Event Type: 5
[14:18:25.878][D][voice_assistant:668]: Intent started
[14:18:25.886][D][voice_assistant:624]: Event Type: 6
[14:18:25.892][D][voice_assistant:624]: Event Type: 7
[14:18:25.900][D][voice_assistant:721]: Response: "Fertig"
[14:18:25.901][D][voice_assistant:624]: Event Type: 8
[14:18:25.901][D][voice_assistant:743]: Response URL: "http://192.168.178.44:8123/api/tts_proxy/bV5mh07dNhVy98LyIS8wgw.flac"
[14:18:25.902][D][voice_assistant:478]: State changed from AWAITING_RESPONSE to STREAMING_RESPONSE
[14:18:25.915][D][voice_assistant:485]: Desired state set to STREAMING_RESPONSE
[14:18:25.919][D][voice_assistant:624]: Event Type: 2
[14:18:25.920][D][voice_assistant:766]: Assist Pipeline ended
[14:18:25.930][D][media_player:084]: 'Marvin Media Player' - Setting
[14:18:25.942][D][media_player:091]: Media URL: http://192.168.178.44:8123/api/tts_proxy/bV5mh07dNhVy98LyIS8wgw.flac
[14:18:25.943][D][media_player:097]: Announcement: yes
[14:18:25.950][D][main:583]: =>> on_end: Voice assistant has finished all tasks.
[14:18:25.964][D][speaker_media_player:406]: State changed to ANNOUNCING
[14:18:25.983][D][speaker_media_player.pipeline:114]: Reading FLAC file type
[14:18:25.983][D][ring_buffer:034][ann_read]: Created ring buffer with size 500000
[14:18:26.073][D][speaker_media_player.pipeline:124]: Decoded audio has 1 channels, 16000 Hz sample rate, and 16 bits per sample
[14:18:26.085][D][i2s_audio.speaker:102]: Starting
[14:18:26.088][D][i2s_audio.speaker:106]: Started
[14:18:26.088][D][ring_buffer:034][speaker_task]: Created ring buffer with size 16000
INFO Processing unexpected disconnect from ESPHome API for marvin @ 192.168.178.76
WARNING Disconnected from API
If I enter the URLs into the browser the PC play the sound…
Also with a MAX98357A, but from a different supplier. I had ordered two more for another project. One board is blue with a blue connector and the other is purple with a green connector, but actually both are MAX98357As with identical specifications. The purple one works well.
The MAX98357A board was the last possible source of the problem. If switching it hadn’t helped, I would have given up. I had already switched to the GPIOs you suggested, but it still wouldn’t work properly; sometimes half a sentence would come out, but mostly nothing. Now everything is working reliably (so far).