HA Voice Preview Edition slow to play audio

My HA Voice Preview Edition is slow to play audio.
I initially thought this was just a slow response from the assistant or the STT/TTS processing, but looking at the timings, these are relatively fast. It processes everything within 3 seconds, but takes a good 20+ seconds to actually play the response.

Voice Processing:

I noticed when looking at the actual device on HA that is sits in the “Responding” phase for a long time (that 20s gap), before then moving to “Playing”.
image

The kicker is that this slow behaviour can also be seen when trying to play local media directly on the device, its not quite as long - but still there…

It almost feels like the devices needs to create a buffer for the audio rather than streaming it live. Any other experiences of this?

Have only just started playing with the PE, however I have noticed that when sending our bedtime message to it (that is reasonably long), there is a delay (longer than the out going nest hub) before it starts to play and it appears to play only a few seconds then sections of silence while it misses bits out before eventually stopping completely having not played the full message.

It’s symptoms are almost like it try’s to cache the full message to the device before playing it and it can’t cope, but I assume it’s supposed to be or is actually streaming it.

That is my feeling too. A little bit annoying as it kinda breaks a longer conversational response if you’re waiting 10+ seconds for it to buffer/cache before responding.

I’d like to see if anyone else is getting similar symptoms.

Ok, I’ve played around some more and set up a second device and that doesn’t have the same issue.

It could potentially be a hardware issue, but definitely needs looking in to - hopefully an update can look at resolving, or this hardware may need returning.

If anyone stumbles across this I’d really value some input. I will add the relevant logs of my two devices, where you can see the massive delay in one compared to the other even when asking a local question such as “What is the time?”

Normal Response Speed - Logs
[00:29:40][D][micro_wake_word:417]: State changed from IDLE to DETECTING_WAKE_WORD
[00:29:50][D][power_supply:048]: Disabling power supply.
[00:29:59][D][micro_wake_word:355]: Detected 'Okay Nabu' with sliding average probability is 0.98 and max probability is 1.00
[00:29:59][D][media_player:080]: 'Media Player' - Setting
[00:29:59][D][media_player:084]:   Command: STOP
[00:29:59][D][media_player:093]:  Announcement: yes
[00:29:59][D][media_player:080]: 'Media Player' - Setting
[00:29:59][D][media_player:093]:  Announcement: yes
[00:29:59][D][ring_buffer:034]: Created ring buffer with size 48000
[00:29:59][D][ring_buffer:034]: Created ring buffer with size 48000
[00:29:59][D][ring_buffer:034]: Created ring buffer with size 65536
[00:29:59][D][ring_buffer:034]: Created ring buffer with size 65536
[00:29:59][D][nabu_media_player.pipeline:173]: Reading FLAC file type
[00:29:59][D][nabu_media_player.pipeline:184]: Decoded audio has 1 channels, 48000 Hz sample rate, and 16 bits per sample
[00:29:59][D][nabu_media_player.pipeline:211]: Converting mono channel audio to stereo channel audio
[00:29:59][D][ring_buffer:034][speaker_task]: Created ring buffer with size 19200
[00:29:59][D][i2s_audio.speaker:111]: Starting Speaker
[00:29:59][D][i2s_audio.speaker:116]: Started Speaker
[00:29:59][D][voice_assistant:515]: State changed from IDLE to START_MICROPHONE
[00:29:59][D][voice_assistant:522]: Desired state set to START_PIPELINE
[00:29:59][D][voice_assistant:225]: Starting Microphone
[00:29:59][D][ring_buffer:034]: Created ring buffer with size 16384
[00:29:59][D][voice_assistant:515]: State changed from START_MICROPHONE to STARTING_MICROPHONE
[00:29:59][D][voice_assistant:515]: State changed from STARTING_MICROPHONE to START_PIPELINE
[00:29:59][D][voice_assistant:280]: Requesting start...
[00:29:59][D][voice_assistant:515]: State changed from START_PIPELINE to STARTING_PIPELINE
[00:29:59][D][voice_assistant:537]: Client started, streaming microphone
[00:29:59][D][voice_assistant:515]: State changed from STARTING_PIPELINE to STREAMING_MICROPHONE
[00:29:59][D][voice_assistant:522]: Desired state set to STREAMING_MICROPHONE
[00:29:59][D][voice_assistant:641]: Event Type: 1
[00:29:59][D][voice_assistant:644]: Assist Pipeline running
[00:29:59][D][voice_assistant:641]: Event Type: 3
[00:29:59][D][voice_assistant:655]: STT started
[00:29:59][D][light:036]: 'voice_assistant_leds' Setting:
[00:29:59][D][light:047]:   State: ON
[00:29:59][D][light:051]:   Brightness: 66%
[00:29:59][D][light:109]:   Effect: 'Waiting for Command'
[00:29:59][D][power_supply:033]: Enabling power supply.
[00:30:00][D][esp32.preferences:114]: Saving 4 preferences to flash...
[00:30:00][D][esp32.preferences:142]: Saving 4 preferences to flash: 3 cached, 1 written, 0 failed
[00:30:01][D][voice_assistant:641]: Event Type: 11
[00:30:01][D][voice_assistant:804]: Starting STT by VAD
[00:30:01][D][light:036]: 'voice_assistant_leds' Setting:
[00:30:01][D][light:051]:   Brightness: 66%
[00:30:01][D][light:109]:   Effect: 'Listening For Command'
[00:30:03][D][voice_assistant:641]: Event Type: 12
[00:30:03][D][voice_assistant:808]: STT by VAD end
[00:30:03][D][voice_assistant:515]: State changed from STREAMING_MICROPHONE to STOP_MICROPHONE
[00:30:03][D][voice_assistant:522]: Desired state set to AWAITING_RESPONSE
[00:30:03][D][voice_assistant:515]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE
[00:30:03][D][light:036]: 'voice_assistant_leds' Setting:
[00:30:03][D][light:051]:   Brightness: 66%
[00:30:03][D][light:109]:   Effect: 'Thinking'
[00:30:03][D][voice_assistant:515]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE
[00:30:03][D][voice_assistant:515]: State changed from AWAITING_RESPONSE to AWAITING_RESPONSE
[00:30:03][D][power_supply:033]: Enabling power supply.
[00:30:04][D][power_supply:033]: Enabling power supply.
[00:30:04][D][power_supply:033]: Enabling power supply.
[00:30:04][D][power_supply:033]: Enabling power supply.
[00:30:05][D][power_supply:033]: Enabling power supply.
[00:30:05][D][power_supply:033]: Enabling power supply.
[00:30:06][D][power_supply:033]: Enabling power supply.
[00:30:06][D][power_supply:033]: Enabling power supply.
[00:30:06][D][power_supply:033]: Enabling power supply.
[00:30:06][D][voice_assistant:641]: Event Type: 4
[00:30:06][D][voice_assistant:669]: Speech recognised as: " What's the time?"
[00:30:06][D][voice_assistant:641]: Event Type: 5
[00:30:06][D][voice_assistant:674]: Intent started
[00:30:06][D][voice_assistant:641]: Event Type: 6
[00:30:06][D][voice_assistant:641]: Event Type: 7
[00:30:06][D][voice_assistant:697]: Response: "0:30 AM"
[00:30:06][D][light:036]: 'voice_assistant_leds' Setting:
[00:30:06][D][light:051]:   Brightness: 66%
[00:30:06][D][light:109]:   Effect: 'Replying'
[00:30:06][D][voice_assistant:641]: Event Type: 8
[00:30:06][D][voice_assistant:719]: Response URL: "http://192.168.100.5:9123/api/tts_proxy/0iJDBoepgauiQul-S_SvFQ.flac"
[00:30:06][D][voice_assistant:515]: State changed from AWAITING_RESPONSE to STREAMING_RESPONSE
[00:30:06][D][voice_assistant:522]: Desired state set to STREAMING_RESPONSE
[00:30:06][D][media_player:080]: 'Media Player' - Setting
[00:30:06][D][media_player:087]:   Media URL: http://192.168.100.5:9123/api/tts_proxy/0iJDBoepgauiQul-S_SvFQ.flac
[00:30:06][D][media_player:093]:  Announcement: yes
[00:30:06][D][voice_assistant:641]: Event Type: 2
[00:30:06][D][voice_assistant:733]: Assist Pipeline ended
[00:30:07][D][nabu_media_player.pipeline:173]: Reading FLAC file type
[00:30:07][D][nabu_media_player.pipeline:184]: Decoded audio has 1 channels, 48000 Hz sample rate, and 16 bits per sample
[00:30:07][D][nabu_media_player.pipeline:211]: Converting mono channel audio to stereo channel audio
[00:30:08][D][voice_assistant:515]: State changed from STREAMING_RESPONSE to IDLE
[00:30:08][D][voice_assistant:522]: Desired state set to IDLE
[00:30:08][D][light:036]: 'voice_assistant_leds' Setting:
[00:30:08][D][light:047]:   State: OFF
[00:30:08][D][light:109]:   Effect: 'None'
[00:30:16][I][safe_mode:041]: Boot seems successful; resetting boot loop counter
[00:30:16][D][esp32.preferences:114]: Saving 1 preferences to flash...
[00:30:17][D][esp32.preferences:142]: Saving 1 preferences to flash: 0 cached, 1 written, 0 failed
[00:30:18][D][power_supply:048]: Disabling power supply.
Slow Response Speed - Logs
[00:32:14][D][micro_wake_word:417]: State changed from IDLE to DETECTING_WAKE_WORD
[00:32:19][D][micro_wake_word:355]: Detected 'Okay Nabu' with sliding average probability is 0.98 and max probability is 1.00
[00:32:19][D][media_player:080]: 'Media Player' - Setting
[00:32:19][D][media_player:084]:   Command: STOP
[00:32:19][D][media_player:093]:  Announcement: yes
[00:32:19][D][media_player:080]: 'Media Player' - Setting
[00:32:19][D][media_player:093]:  Announcement: yes
[00:32:19][D][ring_buffer:034]: Created ring buffer with size 48000
[00:32:19][D][ring_buffer:034]: Created ring buffer with size 48000
[00:32:19][D][ring_buffer:034]: Created ring buffer with size 65536
[00:32:19][D][ring_buffer:034]: Created ring buffer with size 65536
[00:32:19][D][nabu_media_player.pipeline:173]: Reading FLAC file type
[00:32:19][D][nabu_media_player.pipeline:184]: Decoded audio has 1 channels, 48000 Hz sample rate, and 16 bits per sample
[00:32:19][D][nabu_media_player.pipeline:211]: Converting mono channel audio to stereo channel audio
[00:32:19][D][ring_buffer:034][speaker_task]: Created ring buffer with size 19200
[00:32:19][D][i2s_audio.speaker:111]: Starting Speaker
[00:32:19][D][i2s_audio.speaker:116]: Started Speaker
[00:32:19][D][voice_assistant:515]: State changed from IDLE to START_MICROPHONE
[00:32:19][D][voice_assistant:522]: Desired state set to START_PIPELINE
[00:32:19][D][voice_assistant:225]: Starting Microphone
[00:32:19][D][ring_buffer:034]: Created ring buffer with size 16384
[00:32:19][D][voice_assistant:515]: State changed from START_MICROPHONE to STARTING_MICROPHONE
[00:32:19][D][voice_assistant:515]: State changed from STARTING_MICROPHONE to START_PIPELINE
[00:32:19][D][voice_assistant:280]: Requesting start...
[00:32:19][D][voice_assistant:515]: State changed from START_PIPELINE to STARTING_PIPELINE
[00:32:19][D][voice_assistant:537]: Client started, streaming microphone
[00:32:19][D][voice_assistant:515]: State changed from STARTING_PIPELINE to STREAMING_MICROPHONE
[00:32:19][D][voice_assistant:522]: Desired state set to STREAMING_MICROPHONE
[00:32:19][D][voice_assistant:641]: Event Type: 1
[00:32:19][D][voice_assistant:644]: Assist Pipeline running
[00:32:19][D][voice_assistant:641]: Event Type: 3
[00:32:19][D][voice_assistant:655]: STT started
[00:32:19][D][light:036]: 'voice_assistant_leds' Setting:
[00:32:19][D][light:047]:   State: ON
[00:32:19][D][light:051]:   Brightness: 66%
[00:32:19][D][light:109]:   Effect: 'Waiting for Command'
[00:32:19][D][power_supply:033]: Enabling power supply.
[00:32:20][D][voice_assistant:641]: Event Type: 11
[00:32:20][D][voice_assistant:804]: Starting STT by VAD
[00:32:20][D][light:036]: 'voice_assistant_leds' Setting:
[00:32:20][D][light:051]:   Brightness: 66%
[00:32:20][D][light:109]:   Effect: 'Listening For Command'
[00:32:22][D][voice_assistant:641]: Event Type: 12
[00:32:22][D][voice_assistant:808]: STT by VAD end
[00:32:22][D][voice_assistant:515]: State changed from STREAMING_MICROPHONE to STOP_MICROPHONE
[00:32:22][D][voice_assistant:522]: Desired state set to AWAITING_RESPONSE
[00:32:22][D][voice_assistant:515]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE
[00:32:22][D][light:036]: 'voice_assistant_leds' Setting:
[00:32:22][D][light:051]:   Brightness: 66%
[00:32:22][D][light:109]:   Effect: 'Thinking'
[00:32:22][D][voice_assistant:515]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE
[00:32:22][D][voice_assistant:515]: State changed from AWAITING_RESPONSE to AWAITING_RESPONSE
[00:32:22][D][power_supply:033]: Enabling power supply.
[00:32:23][D][power_supply:033]: Enabling power supply.
[00:32:23][D][power_supply:033]: Enabling power supply.
[00:32:23][D][power_supply:033]: Enabling power supply.
[00:32:24][D][power_supply:033]: Enabling power supply.
[00:32:24][D][power_supply:033]: Enabling power supply.
[00:32:24][D][power_supply:033]: Enabling power supply.
[00:32:25][D][power_supply:033]: Enabling power supply.
[00:32:25][D][power_supply:033]: Enabling power supply.
[00:32:25][D][power_supply:033]: Enabling power supply.
[00:32:26][D][power_supply:033]: Enabling power supply.
[00:32:26][D][power_supply:033]: Enabling power supply.
[00:32:26][D][power_supply:033]: Enabling power supply.
[00:32:26][D][esp32.preferences:114]: Saving 4 preferences to flash...
[00:32:26][D][esp32.preferences:142]: Saving 4 preferences to flash: 3 cached, 1 written, 0 failed
[00:32:27][D][power_supply:033]: Enabling power supply.
[00:32:27][D][power_supply:033]: Enabling power supply.
[00:32:27][D][power_supply:033]: Enabling power supply.
[00:32:28][D][power_supply:033]: Enabling power supply.
[00:32:28][D][power_supply:033]: Enabling power supply.
[00:32:28][D][power_supply:033]: Enabling power supply.
[00:32:29][D][power_supply:033]: Enabling power supply.
[00:32:29][D][power_supply:033]: Enabling power supply.
[00:32:29][D][power_supply:033]: Enabling power supply.
[00:32:30][D][power_supply:033]: Enabling power supply.
[00:32:30][D][power_supply:033]: Enabling power supply.
[00:32:30][D][power_supply:033]: Enabling power supply.
[00:32:31][D][power_supply:033]: Enabling power supply.
[00:32:31][D][power_supply:033]: Enabling power supply.
[00:32:31][D][power_supply:033]: Enabling power supply.
[00:32:32][D][power_supply:033]: Enabling power supply.
[00:32:32][D][power_supply:033]: Enabling power supply.
[00:32:32][D][power_supply:033]: Enabling power supply.
[00:32:33][D][power_supply:033]: Enabling power supply.
[00:32:33][D][power_supply:033]: Enabling power supply.
[00:32:33][D][power_supply:033]: Enabling power supply.
[00:32:34][D][power_supply:033]: Enabling power supply.
[00:32:34][D][power_supply:033]: Enabling power supply.
[00:32:34][D][power_supply:033]: Enabling power supply.
[00:32:35][D][power_supply:033]: Enabling power supply.
[00:32:35][D][power_supply:033]: Enabling power supply.
[00:32:35][D][power_supply:033]: Enabling power supply.
[00:32:36][D][power_supply:033]: Enabling power supply.
[00:32:36][D][power_supply:033]: Enabling power supply.
[00:32:36][D][power_supply:033]: Enabling power supply.
[00:32:37][D][power_supply:033]: Enabling power supply.
[00:32:37][D][power_supply:033]: Enabling power supply.
[00:32:37][D][power_supply:033]: Enabling power supply.
[00:32:38][D][power_supply:033]: Enabling power supply.
[00:32:38][D][power_supply:033]: Enabling power supply.
[00:32:38][D][power_supply:033]: Enabling power supply.
[00:32:39][D][power_supply:033]: Enabling power supply.
[00:32:39][D][power_supply:033]: Enabling power supply.
[00:32:39][D][power_supply:033]: Enabling power supply.
[00:32:40][D][power_supply:033]: Enabling power supply.
[00:32:40][D][power_supply:033]: Enabling power supply.
[00:32:40][D][power_supply:033]: Enabling power supply.
[00:32:41][D][power_supply:033]: Enabling power supply.
[00:32:41][D][power_supply:033]: Enabling power supply.
[00:32:41][D][power_supply:033]: Enabling power supply.
[00:32:42][D][power_supply:033]: Enabling power supply.
[00:32:42][D][power_supply:033]: Enabling power supply.
[00:32:42][D][power_supply:033]: Enabling power supply.
[00:32:43][D][power_supply:033]: Enabling power supply.
[00:32:43][D][power_supply:033]: Enabling power supply.
[00:32:43][D][power_supply:033]: Enabling power supply.
[00:32:44][D][power_supply:033]: Enabling power supply.
[00:32:44][D][power_supply:033]: Enabling power supply.
[00:32:44][D][power_supply:033]: Enabling power supply.
[00:32:45][D][power_supply:033]: Enabling power supply.
[00:32:45][D][power_supply:033]: Enabling power supply.
[00:32:46][D][power_supply:033]: Enabling power supply.
[00:32:46][D][power_supply:033]: Enabling power supply.
[00:32:46][D][power_supply:033]: Enabling power supply.
[00:32:47][D][power_supply:033]: Enabling power supply.
[00:32:47][D][power_supply:033]: Enabling power supply.
[00:32:47][D][power_supply:033]: Enabling power supply.
[00:32:47][I][safe_mode:041]: Boot seems successful; resetting boot loop counter
[00:32:47][D][esp32.preferences:114]: Saving 1 preferences to flash...
[00:32:47][D][esp32.preferences:142]: Saving 1 preferences to flash: 0 cached, 1 written, 0 failed
[00:32:48][D][power_supply:033]: Enabling power supply.
[00:32:48][D][power_supply:033]: Enabling power supply.
[00:32:48][D][power_supply:033]: Enabling power supply.
[00:32:49][D][power_supply:033]: Enabling power supply.
[00:32:49][D][power_supply:033]: Enabling power supply.
[00:32:49][D][power_supply:033]: Enabling power supply.
[00:32:50][D][power_supply:033]: Enabling power supply.
[00:32:50][D][power_supply:033]: Enabling power supply.
[00:32:50][D][power_supply:033]: Enabling power supply.
[00:32:50][D][voice_assistant:641]: Event Type: 4
[00:32:50][D][voice_assistant:669]: Speech recognised as: " What's the time?"
[00:32:50][D][voice_assistant:641]: Event Type: 5
[00:32:50][D][voice_assistant:674]: Intent started
[00:32:50][D][voice_assistant:641]: Event Type: 6
[00:32:50][D][voice_assistant:641]: Event Type: 7
[00:32:51][D][voice_assistant:697]: Response: "0:32 AM"
[00:32:51][D][light:036]: 'voice_assistant_leds' Setting:
[00:32:51][D][light:051]:   Brightness: 66%
[00:32:51][D][light:109]:   Effect: 'Replying'
[00:32:51][D][voice_assistant:641]: Event Type: 8
[00:32:51][D][voice_assistant:719]: Response URL: "http://192.168.100.5:9123/api/tts_proxy/HqM16bKI3X0OZuMWvdTBXQ.flac"
[00:32:51][D][voice_assistant:515]: State changed from AWAITING_RESPONSE to STREAMING_RESPONSE
[00:32:51][D][voice_assistant:522]: Desired state set to STREAMING_RESPONSE
[00:32:51][D][media_player:080]: 'Media Player' - Setting
[00:32:51][D][media_player:087]:   Media URL: http://192.168.100.5:9123/api/tts_proxy/HqM16bKI3X0OZuMWvdTBXQ.flac
[00:32:51][D][media_player:093]:  Announcement: yes
[00:32:51][D][voice_assistant:641]: Event Type: 2
[00:32:51][D][voice_assistant:733]: Assist Pipeline ended
[00:32:51][D][nabu_media_player.pipeline:173]: Reading FLAC file type
[00:32:51][D][nabu_media_player.pipeline:184]: Decoded audio has 1 channels, 48000 Hz sample rate, and 16 bits per sample
[00:32:51][D][nabu_media_player.pipeline:211]: Converting mono channel audio to stereo channel audio
[00:32:53][D][voice_assistant:515]: State changed from STREAMING_RESPONSE to IDLE
[00:32:53][D][voice_assistant:522]: Desired state set to IDLE
[00:32:53][D][light:036]: 'voice_assistant_leds' Setting:
[00:32:53][D][light:047]:   State: OFF
[00:32:53][D][light:109]:   Effect: 'None'

Hi,

I just received my unit and seem to be having the same issue. The stt and processing are very quick, and the tts in my pipeline are alo very quick, but the actual response sounding from the speaker has a delay of 20 ~ 50 seconds or so. Tommorow I have some time to take a look at the logs.

@Aitch I think you should create an issue ticket over here: Issues · esphome/home-assistant-voice-pe · GitHub

1 Like

All streaming clients create a buffer.
There is no live playing as such.
The size might be changed though.

Thank you for the suggestion - I have created an issue ticket here: Some devices slow to play audio · Issue #257 · esphome/home-assistant-voice-pe · GitHub

i’ve found that if you keep the PE awake by scrolling the volume wheel, the TTS plays instantly.
it appears that it’s a bug where if the PE isn’t in an active state, it reverts to a polled ~20 second delay to pull the TTS