ESP32 S3 BOX 3 requires wake word to be repeated multiple times before it responds

I am new to this so I apologize if this is something stupid. I checked the forum and other places for answers and couldn’t find much help.
I recently bought an ESP32 S3 BOX 3 for local wake word detection. I had this issue where it was not responding to the wake words immediately and needed to be repeated a few times to get it to start listening. This was the same issue I had with the M5 Atom Echo. I checked the logs to see what was going on and found out that the wake words were indeed being detected the first time they are uttered, however, it starts waiting for VAD detection. There is no visual response from the box to indicate this state change. I have to repeat my wake word a few more times for it to be recognized again and only then does the graphic change on the display to indicate that it has started listening. I verified this by disabling openWakeWord in Wyoming protocol and the device would wait for VAD and then go to error. It resumed normal operation (responding after 2 or more wake words) when I enabled openWakeWord. The confusing part is that I have not enabled openWakeWord for the voice assistant pipeline used by the Box 3. I am not sure why it is doing this.

[01:13:36][D][micro_wake_word:178]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE
[01:13:44][D][esp-idf:000[D][micro_wake_word:363]: Wake word sliding average probability is 0.565 and most recent probability is 0.957
[01:13:44][D][micro_wake_word:129]: Wake Word Detected
[01:13:44][D][micro_wake_word:178]: State changed from DETECTING_WAKE_WORD to STOP_MICROPHONE
[01:13:44][D][micro_wake_word:135]: Stopping Microphone
[01:13:44][D][esp_adf.microphone:234]: Stopping microphone
[01:13:44][D][micro_wake_word:178]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE
[01:13:44][D][esp-idf:000][filter]: W (606141) AUDIO_ELEMENT: IN-[filter] AEL_IO_ABORT
[01:13:44]
[01:13:44][D][esp-idf:000][read_task]:E (606143) AUDIO_ELEMENT: [filter] Element already stopped
[01:13:44]
[01:13:44][D][esp-idf:000][read_task]: W (606174) AUDIO_PIPELINE: There are no listener registered
[01:13:44]
[01:13:44][D][esp-idf:000][read_task]: I (606176) AUDIO_PIPELINE: audio_pipeline_unlinked
[01:13:44]
[01:13:44][D][esp-idf:000][read_task]: W (606178) AUDIO_ELEMENT: [i2s] Element has not create when AUDIO_ELEMENT_TERMINATE
[01:13:44]
[01:13:44][D][esp-idf:000][read_task]: I (606180) I2S: DMA queue destroyed
[01:13:44]
[01:13:44][D][esp-idf:000][read_task]: W (606180) AUDIO_ELEMENT: [filter] Element has not create when AUDIO_ELEMENT_TERMINATE
[01:13:44]
[01:13:44][D][esp-idf:000][read_task]: W (606184) AUDIO_ELEMENT: [raw] Element has not create when AUDIO_ELEMENT_TERMINATE
[01:13:44]
[01:13:44][D][esp_adf.microphone:285]: Microphone stopped
[01:13:44][D][micro_wake_word:178]: State changed from STOPPING_MICROPHONE to IDLE
[01:13:44][D][voice_assistant:504]: State changed from IDLE to START_MICROPHONE
[01:13:44][D][voice_assistant:510]: Desired state set to WAIT_FOR_VAD
[01:13:44][D][voice_assistant:221]: Starting Microphone
[01:13:44][D][voice_assistant:504]: State changed from START_MICROPHONE to STARTING_MICROPHONE
[01:13:44][D][esp-idf:000][read_task]:  (606196) I2S: DMA Malloc info, datalen=blocksize=512, dma_buf_count=8
[01:13:44]
[01:13:44][D][esp-idf:000][read_task]: I (606200) I2S: I2S0, MCLK output by GPIO2
[01:13:44]
[01:13:44][D][esp-idf:000][read_task]: I (606204) AUDIO_PIPELINE: link el->rb, el:0x3d0593a8, tag:i2s, rb:0x3d0597bc
[01:13:44]
[01:13:44][D][esp-idf:000][read_task]: I (606208) AUDIO_PIPELINE: link el->rb, el:0x3d05951c, tag:filter, rb:0x3d05b7fc
[01:13:44]
[01:13:44][D][esp-idf:000][read_task]: I (606213) AUDIO_ELEMENT: [i2s-0x3d0593a8] Element task created
[01:13:44]
[01:13:44][D][esp-idf:000][read_task]: I (606215) AUDIO_THREAD: The filter task allocate stack on external memory
[01:13:44]
[01:13:44][D][esp-idf:000][read_task]: I (606218) AUDIO_ELEMENT: [filter-0x3d05951c] Element task created
[01:13:44]
[01:13:44][D][esp-idf:000][read_task]: I (606220) AUDIO_ELEMENT: [raw-0x3d05964c] Element task created
[01:13:44]
[01:13:44]
[01:13:44]
[01:13:44][D][esp-idf:000][i2s]: I (606226) AUDIO_ELEMENT: [i2s] AEL_MSG_CMD_RESUME,state:1
[01:13:44]
[01:13:44][D][esp-idf:000][filter]: I (606228) AUDIO_ELEMENT: [filter] AEL_MSG_CMD_RESUME,state:1
[01:13:44]
[01:13:44][D][esp-idf:000][filter]: I (606231) RSP_FILTER: sample rate of source data : 16000, channel of source data : 2, sample rate of destination data : 16000, channel of destination data : 1
[01:13:44]
[01:13:44][D][esp-idf:000][read_task]: I (606235) AUDIO_PIPELINE: Pipeline started
[01:13:44]
[01:13:44][D][esp_adf.microphone:273]: Microphone started
[01:13:44][D][voice_assistant:504]: State changed from STARTING_MICROPHONE to WAIT_FOR_VAD
[01:13:44][D][voice_assistant:245]: Waiting for speech...
[01:13:44][D][voice_assistant:504]: State changed from WAIT_FOR_VAD to WAITING_FOR_VAD
[01:13:48][D][voice_assistant:258]: VAD detected speech
[01:13:48][D][voice_assistant:504]: State changed from WAITING_FOR_VAD to START_PIPELINE
[01:13:48][D][voice_assistant:510]: Desired state set to STREAMING_MICROPHONE
[01:13:48][D][voice_assistant:275]: Requesting start...
[01:13:48][D][voice_assistant:504]: State changed from START_PIPELINE to STARTING_PIPELINE
[01:13:48][D][voice_assistant:525]: Client started, streaming microphone
[01:13:48][D][voice_assistant:504]: State changed from STARTING_PIPELINE to STREAMING_MICROPHONE
[01:13:48][D][voice_assistant:510]: Desired state set to STREAMING_MICROPHONE
[01:13:48][D][voice_assistant:627]: Event Type: 1
[01:13:48][D][voice_assistant:630]: Assist Pipeline running
[01:13:48][D][voice_assistant:627]: Event Type: 9
[01:13:53][D][voice_assistant:627]: Event Type: 10
[01:13:53][D][voice_assistant:636]: Wake word detected
[01:13:53][D][voice_assistant:627]: Event Type: 3
[01:13:53][D][voice_assistant:641]: STT started
[01:13:53][D][text_sensor:064]: 'text_request': Sending state '...'
[01:13:53][D][text_sensor:064]: 'text_response': Sending state '...'
[01:13:53][W][component:237]: Component voice_assistant took a long time for an operation (223 ms).
[01:13:53][W][component:238]: Components should block for at most 30 ms.
[01:13:55][D][voice_assistant:627]: Event Type: 11
[01:13:55][D][voice_assistant:781]: Starting STT by VAD
[01:13:56][D][voice_assistant:627]: Event Type: 12
[01:13:56][D][voice_assistant:785]: STT by VAD end
[01:13:56][D][voice_assistant:504]: State changed from STREAMING_MICROPHONE to STOP_MICROPHONE
[01:13:56][D][voice_assistant:510]: Desired state set to AWAITING_RESPONSE
[01:13:56][D][esp_adf.microphone:234]: Stopping microphone
[01:13:56][D][voice_assistant:504]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE
[01:13:56][D][esp-idf:000][filter]: W (617944) AUDIO_ELEMENT: IN-[filter] AEL_IO_ABORT
[01:13:56]
[01:13:56][D][esp-idf:000][read_task]: E (617946) AUDIO_ELEMENT: [filter] Element already stopped
[01:13:56]
[01:13:56][D][esp-idf:000][read_task]: W (617978) AUDIO_PIPELINE: There are no listener registered
[01:13:56]
[01:13:56][D][esp-idf:000][read_task]: I (617981) AUDIO_PIPELINE: audio_pipeline_unlinked
[01:13:56]
[01:13:56][D][esp-idf:000][read_task]: W (617983) AUDIO_ELEMENT: [i2s] Element has not create when AUDIO_ELEMENT_TERMINATE
[01:13:56]
[01:13:56][D][esp-idf:000][read_task]: I (617985) I2S: DMA queue destroyed
[01:13:56]
[01:13:56][D][esp-idf:000][read_task]: W (617987) AUDIO_ELEMENT: [filter] Element has not create when AUDIO_ELEMENT_TERMINATE
[01:13:56]
[01:13:56][D][esp-idf:000][read_task]: W (617989) AUDIO_ELEMENT: [raw] Element has not create when AUDIO_ELEMENT_TERMINATE
[01:13:56]
[01:13:56][W][component:237]: Component voice_assistant took a long time for an operation (240 ms).
[01:13:56][W][component:238]: Components should block for at most 30 ms.
[01:13:56][D][voice_assistant:627]: Event Type: 4
[01:13:56][D][voice_assistant:655]: Speech recognised as: "How's the weather?"
[01:13:56][D][text_sensor:064]: 'text_request': Sending state 'How's the weather?'
[01:13:57][W][component:237]: Component voice_assistant took a long time for an operation (239 ms).
[01:13:57][W][component:238]: Components should block for at most 30 ms.
[01:13:57][D][voice_assistant:627]: Event Type: 5
[01:13:57][D][voice_assistant:660]: Intent started
[01:13:57][D][esp_adf.microphone:285]: Microphone stopped
[01:13:57][D][voice_assistant:504]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE
[01:13:58][D][voice_assistant:627]: Event Type: 6
[01:13:58][D][voice_assistant:627]: Event Type: 7
[01:13:58][D][voice_assistant:683]: Response: "The weather is currently clear with a temperature of 77°F and a humidity level of 70%."
[01:13:58][D][text_sensor:064]: 'text_response': Sending state 'The weather is currently clear with a temperature of 77°F and a humidity level of 70%.'
[01:13:58][D][voice_assistant:627]: Event Type: 98
[01:13:58][D][voice_assistant:768]: TTS stream start
[01:13:58][D][esp-idf:000][speaker_task]: I (619320) I2S: DMA Malloc info, datalen=blocksize=2048, dma_buf_count=8
[01:13:58]
[01:13:58][D][esp-idf:000][speaker_task]: I (619324) I2S: I2S0, MCLK output by GPIO2
[01:13:58]
[01:13:58][D][esp-idf:000][speaker_task]: I (619328) AUDIO_PIPELINE: link el->rb, el:0x3d059248, tag:raw, rb:0x3d0593b8
[01:13:58]
[01:13:58][D][esp-idf:000][speaker_task]: I (619330) AUDIO_ELEMENT: [raw-0x3d059248] Element task created
[01:13:58]
[01:13:58][D][esp-idf:000][speaker_task]: I (619333) AUDIO_ELEMENT: [i2s-0x3d058fa4] Element task created
[01:13:58]
[01:13:58]
[01:13:58]
[01:13:58][D][esp-idf:000][i2s]: I (619337) AUDIO_ELEMENT: [i2s] AEL_MSG_CMD_RESUME,state:1
[01:13:58]
[01:13:58][D][esp-idf:000][i2s]: I (619338) I2S_STREAM: AUDIO_STREAM_WRITER
[01:13:58]
[01:13:58][D][esp-idf:000][speaker_task]: I (619339) AUDIO_PIPELINE: Pipeline started
[01:13:58]
[01:13:58][W][component:237]: Component voice_assistant took a long time for an operation (266 ms).
[01:13:58][W][component:238]: Components should block for at most 30 ms.
[01:13:58][D][voice_assistant:627]: Event Type: 8
[01:13:58][D][voice_assistant:703]: Response URL: "http://192.168.86.38:8123/api/tts_proxy/1e3f9e1d2b573e298f093f3a784c3ebcfcff39e6_en-us_f0415ad194_cloud.wav"
[01:13:58][D][voice_assistant:504]: State changed from AWAITING_RESPONSE to STREAMING_RESPONSE
[01:13:58][D][voice_assistant:510]: Desired state set to STREAMING_RESPONSE
[01:13:58][D][voice_assistant:627]: Event Type: 2
[01:13:58][D][voice_assistant:717]: Assist Pipeline ended
[01:14:05][D][voice_assistant:627]: Event Type: 99
[01:14:05][D][voice_assistant:776]: TTS stream end
[01:14:05][D][voice_assistant:375]: End of audio stream received
[01:14:05][D][voice_assistant:504]: State changed from STREAMING_RESPONSE to RESPONSE_FINISHED
[01:14:05][D][voice_assistant:510]: Desired state set to RESPONSE_FINISHED
[01:14:07][D][esp-idf:000][speaker_task]: W (628285) AUDIO_PIPELINE: There are no listener registered
[01:14:07]
[01:14:07][D][esp-idf:000][speaker_task]: I (628287) AUDIO_PIPELINE: audio_pipeline_unlinked
[01:14:07]
[01:14:07][D][esp-idf:000][speaker_task]: W (628289) AUDIO_ELEMENT: [i2s] Element has not create when AUDIO_ELEMENT_TERMINATE
[01:14:07]
[01:14:07][D][esp-idf:000][speaker_task]: I (628293) I2S: DMA queue destroyed
[01:14:07]
[01:14:07][D][esp-idf:000][speaker_task]: W (628297) AUDIO_ELEMENT: [filter] Element has not create when AUDIO_ELEMENT_TERMINATE
[01:14:07]
[01:14:07][D][esp-idf:000][speaker_task]: [0;33mW (628299) AUDIO_ELEMENT: [raw] Element has not create when AUDIO_ELEMENT_TERMINATE
[01:14:07]
[01:14:07][D][voice_assistant:407]: Speaker has finished outputting all audio
[01:14:07][D][voice_assistant:504]: State changed from RESPONSE_FINISHED to IDLE
[01:14:07][D][voice_assistant:510]: Desired state set to IDLE
[01:14:07][W][component:237]: Component voice_assistant took a long time for an operation (218 ms).
[01:14:07][W][component:238]: Components should block for at most 30 ms.
[01:14:07][D][micro_wake_word:178]: State changed from IDLE to START_MICROPHONE
[01:14:07][D][micro_wake_word:116]: Starting Microphone
[01:14:07][D][micro_wake_word:178]: State changed from START_MICROPHONE to STARTING_MICROPHONE
[01:14:07][D][esp-idf:000][read_task]: [0;32mI (628529) I2S: DMA Malloc info, datalen=blocksize=512, dma_buf_count=8
[01:14:07]
[01:14:07][D][esp-idf:000][read_task]: I (628533) I2S: I2S0, MCLK output by GPIO2
[01:14:07]
[01:14:07][D][esp-idf:000][read_task]: I (628537) AUDIO_PIPELINE: link el->rb, el:0x3d0593a8, tag:i2s, rb:0x3d0597bc
[01:14:07]
[01:14:07][D][esp-idf:000][read_task]: I (628541) AUDIO_PIPELINE: link el->rb, el:0x3d05951c, tag:filter, rb:0x3d05b7fc
[01:14:07]
[01:14:07][D][esp-idf:000][read_task]: I (628544) AUDIO_ELEMENT: [i2s-0x3d0593a8] Element task created
[01:14:07]
[01:14:07][D][esp-idf:000][read_task]: I (628546) AUDIO_THREAD: The filter task allocate stack on external memory
[01:14:07]
[01:14:07][D][esp-idf:000][read_task]: I (628549) AUDIO_ELEMENT: [filter-0x3d05951c] Element task created
[01:14:07]
[01:14:07][0;36m[D][esp-idf:000][read_task]: I (628551) AUDIO_ELEMENT: [raw-0x3d05964c] Element task created
[01:14:07]
[01:14:07]
[01:14:07]
[01:14:07][D][esp-idf:000][i2s]: I (628557) AUDIO_ELEMENT: [i2s] AEL_MSG_CMD_RESUME,state:1
[01:14:07]
[01:14:07][D][esp-idf:000][filter]: I (628559) AUDIO_ELEMENT: [filter] AEL_MSG_CMD_RESUME,state:1
[01:14:07]
[01:14:07][D][esp-idf:000][filter]: I (628562) RSP_FILTER: sample rate of source data : 16000, channel of source data : 2, sample rate of destination data : 16000, channel of destination data : 1
[01:14:07]
[01:14:07][D][esp-idf:000][read_task]: I (628565) AUDIO_PIPELINE: Pipeline started
[01:14:07]
[01:14:07][D][esp_adf.microphone:273]: Microphone started
[01:14:07][D][micro_wake_word:178]: State changed from STARTING_MICROPHONE to DETECTING_WAKE_WORD

I am having the exact same issue. It’s super annoying. Though, at least now that I know what the issue is I can just say the wake word twice to reliably get it working.

I don’t know if you have tried it yet, the firmware has been updated. It still doesn’t work when it reboots, but if I reset the wake word to run on HA and then revert to on-device, it works flawlessly