First TTS Playback Fails — Works on Retry — Need Help Debugging Speaker Pipeline

I’m running a fork of Rob Meades’ Home Assistant Voice PE custom ESPHome repo, built specifically to handle local voice commands. It uses the old config from Atom Echo to work with OpenWakeWord pipeline. I am running ESPHome 2025.2, as some updates seem to have broken the modifications needed to work with OpenWakeWord.

I’ve hit a consistent problem with my fork:

The first TTS playback (response to my voice request) always fails with:

[E][speaker_media_player.pipeline:112]: Media reader encountered an error: ESP_FAIL

If I repeat the same voice command immediately, it works fine. All subsequent commands work until the device goes idle for a while — then the first playback fails again.

I’ve tried:

  • Ensuring the amp stays always ON (removed power_supply toggling, forced GPIO high with switch: or output: blocks)
  • Added a retry in voice_assistant.cpp to repeat the buffer write after 250ms if speaker_->play() returns 0
  • Increased the speaker-timeout window in the pipeline C++

None of these seem to have fixed the underlying issue. I know from the pipeline that the speaker_media_player starts streaming the .flac TTS file from HA’s api/tts_proxy or api/esphome/ffmpeg_proxy. If the file isn’t fully ready yet (due to TTS generation lag or disk I/O) when the ESP tries to read it, the pipeline throws ESP_FAIL. The next try works because the file is cached and fully present by then.

Here are the lines from the logs where the issue pops up:

First attempt (fails):

[15:20:43][D][voice_assistant:548]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE
[15:20:43][D][voice_assistant:548]: State changed from AWAITING_RESPONSE to AWAITING_RESPONSE
[15:20:47][D][voice_assistant:674]: Event Type: 4
[15:20:47][D][voice_assistant:702]: Speech recognised as: " What time is it?"
[15:20:47][D][voice_assistant:674]: Event Type: 5
[15:20:47][D][voice_assistant:707]: Intent started
[15:20:47][D][voice_assistant:674]: Event Type: 6
[15:20:47][D][voice_assistant:674]: Event Type: 7
[15:20:47][D][voice_assistant:730]: Response: "It is currently 3:20 PM"
[15:20:47][D][light:036]: 'voice_assistant_leds' Setting:
[15:20:47][D][light:051]:   Brightness: 66%
[15:20:47][D][light:109]:   Effect: 'Replying'
[15:20:47][D][voice_assistant:674]: Event Type: 8
[15:20:47][D][voice_assistant:752]: Response URL: "http://192.168.50.46:8123/api/tts_proxy/7-LCRzsapRaFgmgR96bxlA.flac"
[15:20:47][D][voice_assistant:548]: State changed from AWAITING_RESPONSE to STREAMING_RESPONSE
[15:20:47][D][voice_assistant:555]: Desired state set to STREAMING_RESPONSE
[15:20:47][D][media_player:073]: 'Voice PE Dasha' - Setting
[15:20:47][D][media_player:080]:   Media URL: http://192.168.50.46:8123/api/tts_proxy/7-LCRzsapRaFgmgR96bxlA.flac
[15:20:47][D][media_player:086]:  Announcement: yes
[15:20:47][D][speaker_media_player:420]: State changed to ANNOUNCING
[15:20:47][D][voice_assistant:674]: Event Type: 2
[15:20:47][D][voice_assistant:766]: Assist Pipeline ended
[15:20:47][E][speaker_media_player.pipeline:112]: Media reader encountered an error: ESP_FAIL
[15:20:47][D][speaker_media_player:420]: State changed to IDLE
[15:20:47][D][light:036]: 'voice_assistant_leds' Setting:
[15:20:47][D][light:047]:   State: OFF
[15:20:47][D][light:109]:   Effect: 'None'
[15:20:47][D][voice_assistant:548]: State changed from STREAMING_RESPONSE to IDLE
[15:20:47][D][voice_assistant:555]: Desired state set to IDLE
[15:20:47][D][voice_assistant:548]: State changed from IDLE to START_MICROPHONE
[15:20:47][D][voice_assistant:555]: Desired state set to START_PIPELINE
[15:20:47][D][voice_assistant:225]: Starting Microphone
[15:20:47][D][voice_assistant:548]: State changed from START_MICROPHONE to STARTING_MICROPHONE
[15:20:47][D][voice_assistant:548]: State changed from STARTING_MICROPHONE to START_PIPELINE
[15:20:47][D][voice_assistant:280]: Requesting start...
[15:20:47][D][voice_assistant:548]: State changed from START_PIPELINE to STARTING_PIPELINE
[15:20:47][D][voice_assistant:570]: Client started, streaming microphone
[15:20:47][D][voice_assistant:548]: State changed from STARTING_PIPELINE to STREAMING_MICROPHONE
[15:20:47][D][voice_assistant:555]: Desired state set to STREAMING_MICROPHONE
[15:20:47][D][voice_assistant:674]: Event Type: 1
[15:20:47][D][voice_assistant:677]: Assist Pipeline running
[15:20:47][D][voice_assistant:674]: Event Type: 9

[15:20:47][E][speaker_media_player.pipeline:112]: Media reader encountered an error: ESP_FAIL This is where it all goes wrong.

Second attempt (immediate repeat — works):

[D][speaker_media_player.pipeline:124]: Decoded audio has 1 channels, 48000 Hz sample rate, and 16 bits per sample

Any insight is appreciated.