Many problems with the voice assistant

Hello,

(voice assistant, micro-wake-word, speech-to-phrase)
I tried to build a voice assistant based on “# Voice assistant based on Voice Assistant PE - https://github.com/KristopherMackowiak/ha_voice_assistant/” (only with one speaker instead of two), but so far with only moderate success. Speech recognition works, but the whole system crashes very frequently and there is no sound coming from the speaker.
I have tried many variations for speaker:, media_player: and micro_wake_word: — sometimes bits_per_sample: 32bit, sometimes 16bit, sometimes sample_rate: 16000, sometimes 48000. Sometimes you can hear the start sound, but currently mostly not. Occasionally you hear a response, but only once, and then silence until the next crash. Playing sounds or radio from the media player does not work at all. I suspect the problems somehow occur when playing sound, because everything works up until the first sound output. What am I doing wrong?

.
.
.
i2s_audio:
  - id: sz_i2s_output
    # i2s_output data pin is GPIO10
    i2s_lrclk_pin:
      number: GPIO7   #LRC an MAX98357A
    i2s_bclk_pin:
      number: GPIO8   #BCL an MAX98357A

  - id: sz_i2s_input
    # data line is GPIO15
    i2s_lrclk_pin:
      number: GPIO14   #WS am Mikrofon INMP441
    i2s_bclk_pin:
      number: GPIO13  #SCK am Mikrofon INMP441

microphone:
  - platform: i2s_audio
    id: sz_i2s_mics
    i2s_din_pin: GPIO15
    adc_type: external
    i2s_audio_id: sz_i2s_input
    channel: left #L/R Pin INMP441 => GND

speaker:
  # Hardware speaker output
  - platform: i2s_audio
    id: sz_i2s_audio_speaker
    sample_rate: 48000 #16000 #48000
    i2s_mode: primary #Voice-PE secondary
    i2s_dout_pin: GPIO10
    bits_per_sample: 32bit #16bit # 32bit
    i2s_audio_id: sz_i2s_output
    dac_type: external
    channel: mono   # stereo
    timeout: never
    buffer_duration: 100ms

  # Virtual speakers to combine the announcement and media streams together into one output
  - platform: mixer
    id: sz_mixing_speaker
    output_speaker: sz_i2s_audio_speaker
    num_channels: 1 #2
    source_speakers:
      - id: sz_announcement_mixing_input
        timeout: never
      - id: sz_media_mixing_input
        timeout: never

  # Vritual speakers to resample each pipelines' audio, if necessary, as the mixer speaker requires the same sample rate
  - platform: resampler
    id: sz_announcement_resampling_speaker
    output_speaker: sz_announcement_mixing_input
    sample_rate: 48000 #16000 #48000
    bits_per_sample: 16
  - platform: resampler
    id: sz_media_resampling_speaker
    output_speaker: sz_media_mixing_input
    sample_rate: 48000 #16000 #8000
    bits_per_sample: 16

media_player:
  - platform: speaker
    id: external_media_player
    name: Schlafzimmer Media Player
    internal: False
    volume_increment: 0.05
    volume_min: 0.4
    volume_max: 1.0 #0.85
#    buffer_size: 40000 #Must be between 4000 and 4000000. Defaults to 100000
    codec_support_enabled: true # set to flase to save and specify format to save recources 
    announcement_pipeline:
      speaker: sz_announcement_resampling_speaker
#      format: FLAC     # FLAC is the least processor intensive codec
      num_channels: 1  # Stereo audio is unnecessary for announcements
      sample_rate: 48000 #16000 #48000
    media_pipeline:
      speaker: sz_media_resampling_speaker
#      format: FLAC     # FLAC is the least processor intensive codec
      num_channels: 1  #2
      sample_rate: 48000 #16000 #48000
    on_mute:
      - script.execute: control_leds
    on_unmute:
      - script.execute: control_leds
    on_volume:
      - script.execute: control_leds
    on_announcement:
.
.
.

micro_wake_word:
  id: schlafzimmer_mww
  microphone:
    microphone: sz_i2s_mics
  stop_after_detection: false
  models:
    - model: Schlafzimmer_Resources/MWW/okay_nabu.json
      id: okay_nabu
.
.
.

What hardware are you using?

See here for help.

Hardware is like described on the page " GitHub - KristopherMackowiak/ha_voice_assistant: Home Assistant DYI voice assistant"

-ESP32-S3-N16R8
-Max98357
-Inmp441
-LED Ring 12 LEDs

…I think I’ll need a while for this help link with its very long content…

There is a working config in the first post.

Or here

Are you using a good power supply?
I had a similar experience when powering Respeaker Lite from a weak power bank.

I’m currently using a lab power supply. I’ve already increased it to a maximum of 1.2A. That should be enough for a voice assistant with a 2W speaker and 12 LEDs, right?

Assuming you are using ws2812 leds they could draw 720ma at 100% brightness on there own

6W seems sufficient, but take any 2A phone charger to rule out the problem for sure.

If switch on the power there comes short a sound from the speaker and after that silence. It is not possible to play something via the media player and not voice assistant answers are played…

What settings need to be changed for sound to play through the speaker? Sometimes I get one or two sounds from the Assistant, but mostly not. Occasionally I can play an MP3 from the media player a few times, but usually not. I can’t find any pattern or connection between the two.

[14:18:19.928][I][app:190]: ESPHome version 2025.11.2 compiled on Nov 30 2025, 14:10:36
[14:18:19.928][C][logger:261]: Logger:
[14:18:19.928][C][logger:261]:   Max Level: VERBOSE
[14:18:19.928][C][logger:261]:   Initial Level: VERBOSE
[14:18:19.928][C][logger:267]:   Log Baud Rate: 115200
[14:18:19.928][C][logger:267]:   Hardware UART: UART0
[14:18:19.929][C][logger:274]:   Task Log Buffer Size: 768
[14:18:19.946][C][logger:280]:   Level for 'api': VERBOSE
[14:18:19.967][C][template.switch:092]: Template Switch 'Use Marvin wake word'
[14:18:19.967][C][template.switch:092]:   Restore Mode: always OFF
[14:18:19.968][C][template.switch:056]:   Optimistic: YES
[14:18:19.985][C][psram:016]: PSRAM:
[14:18:19.988][C][psram:019]:   Available: YES
[14:18:19.988][C][psram:021]:   Size: 8192 KB
[14:18:20.008][C][i2s_audio.microphone:079]: Microphone:
[14:18:20.008][C][i2s_audio.microphone:079]:   Pin: 15
[14:18:20.008][C][i2s_audio.microphone:079]:   PDM: NO
[14:18:20.008][C][i2s_audio.microphone:079]:   DC offset correction: NO
[14:18:20.022][C][status:018]: Status Binary Sensor 'API Connection'
[14:18:20.023][C][status:021]:   Device Class: 'connectivity'
[14:18:20.040][C][i2s_audio.speaker:074]: Speaker:
[14:18:20.040][C][i2s_audio.speaker:074]:   Pin: 7
[14:18:20.040][C][i2s_audio.speaker:074]:   Buffer duration: 500
[14:18:20.040][C][i2s_audio.speaker:080]:   Timeout: 500 ms
[14:18:20.040][C][i2s_audio.speaker:088]:   Communication format: std
[14:18:20.057][C][captive_portal:122]: Captive Portal:
[14:18:20.074][C][wifi:1062]: WiFi:
[14:18:20.074][C][wifi:1062]:   Connected: YES
[14:18:20.074][C][wifi:827]:   Local MAC: 8e:C5:4E:C3:1C:4D
[14:18:20.076][C][wifi:834]:   IP Address: 192.168.178.76
[14:18:20.078][C][wifi:838]:   SSID: 'XXXXX-FritzBox'[redacted]
[14:18:20.078][C][wifi:838]:   BSSID: 3C:81:CB:05:91:AA[redacted]
[14:18:20.078][C][wifi:838]:   Hostname: 'marvin'
[14:18:20.078][C][wifi:838]:   Signal strength: -52 dB ▂▄▆█
[14:18:20.078][C][wifi:838]:   Channel: 6
[14:18:20.078][C][wifi:838]:   Subnet: 255.255.255.0
[14:18:20.078][C][wifi:838]:   Gateway: 192.168.178.1
[14:18:20.078][C][wifi:838]:   DNS1: 192.168.178.1
[14:18:20.078][C][wifi:838]:   DNS2: 0.0.0.0
[14:18:20.099][C][esphome.ota:093]: Over-The-Air updates:
[14:18:20.099][C][esphome.ota:093]:   Address: marvin.local:3232
[14:18:20.099][C][esphome.ota:093]:   Version: 2
[14:18:20.107][C][esphome.ota:100]:   Password configured
[14:18:20.111][C][safe_mode:018]: Safe Mode:
[14:18:20.111][C][safe_mode:018]:   Successful after: 60s
[14:18:20.111][C][safe_mode:018]:   Invoke after: 10 attempts
[14:18:20.111][C][safe_mode:018]:   Duration: 300s
[14:18:20.123][C][web_server.ota:241]: Web Server OTA
[14:18:20.130][C][api:223]: Server:
[14:18:20.130][C][api:223]:   Address: marvin.local:6053
[14:18:20.130][C][api:223]:   Listen backlog: 4
[14:18:20.130][C][api:223]:   Max connections: 8
[14:18:20.130][C][api:230]:   Noise encryption: YES
[14:18:20.139][C][mdns:177]: mDNS:
[14:18:20.139][C][mdns:177]:   Hostname: marvin
[14:18:20.141][V][mdns:182]:   Services:
[14:18:20.143][V][mdns:184]:   - _esphomelib, _tcp, 6053
[14:18:20.152][V][mdns:187]:     TXT: friendly_name = Marvin
[14:18:20.153][V][mdns:187]:     TXT: version = 2025.11.2
[14:18:20.153][V][mdns:187]:     TXT: mac = 81b56ec31e3d
[14:18:20.161][V][mdns:187]:     TXT: platform = ESP32
[14:18:20.161][V][mdns:187]:     TXT: board = esp32-s3-devkitc-1
[14:18:20.171][V][mdns:187]:     TXT: network = wifi
[14:18:20.180][V][mdns:187]:     TXT: api_encryption = Noise_NNpsk0_25519_ChaChaPoly_SHA256
[14:18:22.506][D][voice_assistant:624]: Event Type: 10
[14:18:22.507][D][voice_assistant:641]: Wake word detected
[14:18:22.510][D][main:578]: =>> on_wake_word_detected: Voice assistant has detected the wakeword!!
[14:18:22.510][D][voice_assistant:624]: Event Type: 3
[14:18:22.510][D][voice_assistant:646]: STT started
[14:18:22.523][D][main:568]: =>> on_listening: Voice assistant is listening...
[14:18:22.789][D][voice_assistant:624]: Event Type: 11
[14:18:22.791][D][voice_assistant:827]: Starting STT by VAD
[14:18:25.421][D][voice_assistant:624]: Event Type: 12
[14:18:25.421][D][voice_assistant:831]: STT by VAD end
[14:18:25.421][D][voice_assistant:478]: State changed from STREAMING_MICROPHONE to STOP_MICROPHONE
[14:18:25.421][D][voice_assistant:485]: Desired state set to AWAITING_RESPONSE
[14:18:25.425][D][voice_assistant:478]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE
[14:18:25.439][D][voice_assistant:478]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE
[14:18:25.440][V][i2s_audio.microphone:486]: Task finished, freeing resources and uninstalling driver
[14:18:25.876][D][voice_assistant:624]: Event Type: 4
[14:18:25.876][D][voice_assistant:663]: Speech recognised as: "Stehlampe ausschalten"
[14:18:25.878][D][voice_assistant:624]: Event Type: 5
[14:18:25.878][D][voice_assistant:668]: Intent started
[14:18:25.886][D][voice_assistant:624]: Event Type: 6
[14:18:25.892][D][voice_assistant:624]: Event Type: 7
[14:18:25.900][D][voice_assistant:721]: Response: "Fertig"
[14:18:25.901][D][voice_assistant:624]: Event Type: 8
[14:18:25.901][D][voice_assistant:743]: Response URL: "http://192.168.178.44:8123/api/tts_proxy/bV5mh07dNhVy98LyIS8wgw.flac"
[14:18:25.902][D][voice_assistant:478]: State changed from AWAITING_RESPONSE to STREAMING_RESPONSE
[14:18:25.915][D][voice_assistant:485]: Desired state set to STREAMING_RESPONSE
[14:18:25.919][D][voice_assistant:624]: Event Type: 2
[14:18:25.920][D][voice_assistant:766]: Assist Pipeline ended
[14:18:25.930][D][media_player:084]: 'Marvin Media Player' - Setting
[14:18:25.942][D][media_player:091]:   Media URL: http://192.168.178.44:8123/api/tts_proxy/bV5mh07dNhVy98LyIS8wgw.flac
[14:18:25.943][D][media_player:097]:  Announcement: yes
[14:18:25.950][D][main:583]: =>> on_end: Voice assistant has finished all tasks.
[14:18:25.964][D][speaker_media_player:406]: State changed to ANNOUNCING
[14:18:25.983][D][speaker_media_player.pipeline:114]: Reading FLAC file type
[14:18:25.983][D][ring_buffer:034][ann_read]: Created ring buffer with size 500000
[14:18:26.073][D][speaker_media_player.pipeline:124]: Decoded audio has 1 channels, 16000 Hz sample rate, and 16 bits per sample
[14:18:26.085][D][i2s_audio.speaker:102]: Starting
[14:18:26.088][D][i2s_audio.speaker:106]: Started
[14:18:26.088][D][ring_buffer:034][speaker_task]: Created ring buffer with size 16000
INFO Processing unexpected disconnect from ESPHome API for marvin @ 192.168.178.76
WARNING Disconnected from API

If I enter the URLs into the browser the PC play the sound…

I replaced the MAX98357A with a different module from another supplier, and now it works. Sound output is working without any problems.

What did you replace it with?

Also with a MAX98357A, but from a different supplier. I had ordered two more for another project. One board is blue with a blue connector and the other is purple with a green connector, but actually both are MAX98357As with identical specifications. The purple one works well.

The MAX98357A board was the last possible source of the problem. If switching it hadn’t helped, I would have given up. I had already switched to the GPIOs you suggested, but it still wouldn’t work properly; sometimes half a sentence would come out, but mostly nothing. Now everything is working reliably (so far).

1 Like

That’s interesting I only have purple ones never seen a blue one. Looks like colour really does matter :slight_smile: