Improve finish speaking detection

OptimalHome · May 11, 2024, 4:38am

I’m setting up my first esp32-s3-box-3 and trying to play around with voice assistant. If I’m reading the logs right, the device listened for 15s. I’ve tired default, aggressive, and relaxed settings for finish speaking detection. Nothing seems to help.

Any suggestions?

events:
  - type: run-start
    data:
      pipeline: 01gzhxcf4s2670p1sc24re4v2g
      language: en
    timestamp: "2024-05-11T04:21:30.688537+00:00"
  - type: stt-start
    data:
      engine: stt.home_assistant_cloud
      metadata:
        language: en-US
        format: wav
        codec: pcm
        bit_rate: 16
        sample_rate: 16000
        channel: 1
    timestamp: "2024-05-11T04:21:30.688615+00:00"
  - type: stt-vad-end
    data:
      timestamp: 7500
    timestamp: "2024-05-11T04:21:45.721675+00:00"

robgough1970 · May 12, 2024, 3:44pm

watch the esphome device logs to see what’s happening from the device. it may give you a better idea of what’s happening

OptimalHome · May 13, 2024, 4:31am

Thanks, I just collected logs from ESPHome (below). It does look like the microphone is streaming for 15 seconds before it stops

[21:24:35][D][voice_assistant:439]: State changed from STARTING_MICROPHONE to STREAMING_MICROPHONE
[21:24:50][D][voice_assistant:563]: Event Type: 12
[21:24:50][D][voice_assistant:721]: STT by VAD end

Full logs:

[21:24:34][D][micro_wake_word:362]: Wake word sliding average probability is 0.556 and most recent probability is 0.910
[21:24:34][D][micro_wake_word:128]: Wake Word Detected
[21:24:34][D][micro_wake_word:177]: State changed from DETECTING_WAKE_WORD to STOP_MICROPHONE
[21:24:34][D][micro_wake_word:134]: Stopping Microphone
[21:24:34][D][esp_adf.microphone:234]: Stopping microphone
[21:24:34][D][micro_wake_word:177]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE
[21:24:34][D][esp-idf:000]: W (226440) AUDIO_ELEMENT: IN-[filter] AEL_IO_ABORT

[21:24:34][D][esp-idf:000]: E (226444) AUDIO_ELEMENT: [filter] Element already stopped

[21:24:34][D][esp-idf:000]: W (226474) AUDIO_PIPELINE: There are no listener registered

[21:24:34][D][esp-idf:000]: I (226479) AUDIO_PIPELINE: audio_pipeline_unlinked

[21:24:34][D][esp-idf:000]: W (226482) AUDIO_ELEMENT: [i2s] Element has not create when AUDIO_ELEMENT_TERMINATE

[21:24:34][D][esp-idf:000]: I (226486) I2S: DMA queue destroyed

[21:24:34][D][esp-idf:000]: W (226493) AUDIO_ELEMENT: [filter] Element has not create when AUDIO_ELEMENT_TERMINATE

[21:24:34][D][esp-idf:000]: W (226497) AUDIO_ELEMENT: [raw] Element has not create when AUDIO_ELEMENT_TERMINATE

[21:24:34][D][esp_adf.microphone:285]: Microphone stopped
[21:24:34][D][micro_wake_word:177]: State changed from STOPPING_MICROPHONE to IDLE
[21:24:34][D][voice_assistant:439]: State changed from IDLE to START_PIPELINE
[21:24:34][D][voice_assistant:445]: Desired state set to START_MICROPHONE
[21:24:34][D][voice_assistant:126]: microphone not running
[21:24:34][D][voice_assistant:210]: Requesting start...
[21:24:34][D][voice_assistant:439]: State changed from START_PIPELINE to STARTING_PIPELINE
[21:24:34][D][voice_assistant:126]: microphone not running
[21:24:34][D][voice_assistant:126]: microphone not running
[21:24:34][D][voice_assistant:126]: microphone not running
[21:24:34][D][voice_assistant:126]: microphone not running
[21:24:34][D][voice_assistant:126]: microphone not running
[21:24:34][D][voice_assistant:126]: microphone not running
[21:24:34][D][voice_assistant:126]: microphone not running
[21:24:34][D][voice_assistant:126]: microphone not running
[21:24:34][D][voice_assistant:126]: microphone not running
[21:24:34][D][voice_assistant:460]: Client started, streaming microphone
[21:24:34][D][voice_assistant:439]: State changed from STARTING_PIPELINE to START_MICROPHONE
[21:24:34][D][voice_assistant:445]: Desired state set to STREAMING_MICROPHONE
[21:24:34][D][voice_assistant:163]: Starting Microphone
[21:24:34][D][voice_assistant:439]: State changed from START_MICROPHONE to STARTING_MICROPHONE
[21:24:34][D][voice_assistant:563]: Event Type: 1
[21:24:34][D][voice_assistant:566]: Assist Pipeline running
[21:24:34][D][esp-idf:000]: I (226638) I2S: DMA Malloc info, datalen=blocksize=512, dma_buf_count=8

[21:24:34][D][esp-idf:000]: I (226648) I2S: I2S0, MCLK output by GPIO2

[21:24:34][D][esp-idf:000]: I (226656) AUDIO_PIPELINE: link el->rb, el:0x3d05c4ac, tag:i2s, rb:0x3d05c8c0

[21:24:34][D][esp-idf:000]: I (226665) AUDIO_PIPELINE: link el->rb, el:0x3d05c620, tag:filter, rb:0x3d05e900

[21:24:34][D][esp-idf:000]: I (226671) AUDIO_ELEMENT: [i2s-0x3d05c4ac] Element task created

[21:24:34][D][esp-idf:000]: I (226675) AUDIO_THREAD: The filter task allocate stack on external memory

[21:24:34][D][esp-idf:000]: I (226680) AUDIO_ELEMENT: [filter-0x3d05c620] Element task created

[21:24:34][D][esp-idf:000]: I (226687) AUDIO_ELEMENT: [raw-0x3d05c750] Element task created

[21:24:34][D][esp-idf:000]: I (226693) AUDIO_PIPELINE: Func:audio_pipeline_run, Line:359, MEM Total:16463891 Bytes, Inter:82244 Bytes, Dram:82244 Bytes


[21:24:34][D][esp-idf:000]: I (226700) AUDIO_ELEMENT: [i2s] AEL_MSG_CMD_RESUME,state:1

[21:24:34][D][esp-idf:000]: I (226704) AUDIO_ELEMENT: [filter] AEL_MSG_CMD_RESUME,state:1

[21:24:34][D][esp-idf:000]: I (226709) RSP_FILTER: sample rate of source data : 16000, channel of source data : 2, sample rate of destination data : 16000, channel of destination data : 1

[21:24:34][D][esp-idf:000]: I (226716) AUDIO_PIPELINE: Pipeline started

[21:24:35][W][component:237]: Component voice_assistant took a long time for an operation (247 ms).
[21:24:35][W][component:238]: Components should block for at most 30 ms.
[21:24:35][D][esp_adf.microphone:273]: Microphone started
[21:24:35][D][voice_assistant:439]: State changed from STARTING_MICROPHONE to STREAMING_MICROPHONE
[21:24:50][D][voice_assistant:563]: Event Type: 12
[21:24:50][D][voice_assistant:721]: STT by VAD end
[21:24:50][D][voice_assistant:439]: State changed from STREAMING_MICROPHONE to STOP_MICROPHONE
[21:24:50][D][voice_assistant:445]: Desired state set to AWAITING_RESPONSE
[21:24:50][D][esp_adf.microphone:234]: Stopping microphone
[21:24:50][D][voice_assistant:439]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE
[21:24:50][D][esp-idf:000]: E (242363) AUDIO_ELEMENT: [filter] Element already stopped

[21:24:50][D][esp-idf:000]: W (242392) AUDIO_PIPELINE: There are no listener registered

[21:24:50][D][esp-idf:000]: I (242396) AUDIO_PIPELINE: audio_pipeline_unlinked

[21:24:50][D][esp-idf:000]: W (242401) AUDIO_ELEMENT: [i2s] Element has not create when AUDIO_ELEMENT_TERMINATE

[21:24:50][D][esp-idf:000]: I (242409) I2S: DMA queue destroyed

[21:24:50][D][esp-idf:000]: W (242415) AUDIO_ELEMENT: [filter] Element has not create when AUDIO_ELEMENT_TERMINATE

[21:24:50][D][esp-idf:000]: W (242420) AUDIO_ELEMENT: [raw] Element has not create when AUDIO_ELEMENT_TERMINATE

[21:24:50][W][component:237]: Component voice_assistant took a long time for an operation (230 ms).
[21:24:50][W][component:238]: Components should block for at most 30 ms.
[21:24:50][D][esp_adf.microphone:285]: Microphone stopped
[21:24:50][D][voice_assistant:439]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE
[21:24:57][D][voice_assistant:563]: Event Type: 4
[21:24:57][D][voice_assistant:591]: Speech recognised as: " Turn on the master bedroom lights."
[21:24:57][D][text_sensor:064]: 'text_request': Sending state ' Turn on the master bedroom lights.'
[21:24:57][W][component:237]: Component voice_assistant took a long time for an operation (223 ms).
[21:24:57][W][component:238]: Components should block for at most 30 ms.
[21:24:57][D][voice_assistant:563]: Event Type: 5
[21:24:57][D][voice_assistant:596]: Intent started
[21:24:57][D][voice_assistant:563]: Event Type: 6
[21:24:57][D][voice_assistant:563]: Event Type: 7
[21:24:57][D][voice_assistant:619]: Response: "Turned on the light"
[21:24:57][D][text_sensor:064]: 'text_response': Sending state 'Turned on the light'
[21:24:57][D][voice_assistant:563]: Event Type: 98
[21:24:57][D][voice_assistant:704]: TTS stream start
[21:24:57][D][esp-idf:000]: I (249651) I2S: DMA Malloc info, datalen=blocksize=2048, dma_buf_count=8

[21:24:57][D][esp-idf:000]: I (249659) I2S: I2S0, MCLK output by GPIO2

[21:24:57][D][esp-idf:000]: I (249664) AUDIO_PIPELINE: link el->rb, el:0x3d05c34c, tag:raw, rb:0x3d05c4bc

[21:24:57][D][esp-idf:000]: I (249670) AUDIO_ELEMENT: [raw-0x3d05c34c] Element task created

[21:24:57][D][esp-idf:000]: I (249678) AUDIO_ELEMENT: [i2s-0x3d05c0a8] Element task created

[21:24:57][D][esp-idf:000]: I (249682) AUDIO_PIPELINE: Func:audio_pipeline_run, Line:359, MEM Total:16463679 Bytes, Inter:74792 Bytes, Dram:74792 Bytes


[21:24:57][D][esp-idf:000]: I (249688) AUDIO_ELEMENT: [i2s] AEL_MSG_CMD_RESUME,state:1

[21:24:57][D][esp-idf:000]: I (249691) I2S_STREAM: AUDIO_STREAM_WRITER

[21:24:58][W][component:237]: Component voice_assistant took a long time for an operation (244 ms).
[21:24:58][W][component:238]: Components should block for at most 30 ms.
[21:24:58][D][voice_assistant:563]: Event Type: 8
[21:24:58][D][voice_assistant:639]: Response URL: "http://192.168.50.135:8123/api/tts_proxy/104c89b5f9053e4751d03002aab527c96124bd77_en-us_4d30e09a66_tts.piper.wav"
[21:24:58][D][voice_assistant:439]: State changed from AWAITING_RESPONSE to STREAMING_RESPONSE
[21:24:58][D][voice_assistant:445]: Desired state set to STREAMING_RESPONSE
[21:24:58][D][voice_assistant:563]: Event Type: 2
[21:24:58][D][voice_assistant:653]: Assist Pipeline ended
[21:24:59][D][voice_assistant:563]: Event Type: 99
[21:24:59][D][voice_assistant:712]: TTS stream end
[21:24:59][D][voice_assistant:310]: End of audio stream received
[21:24:59][D][voice_assistant:439]: State changed from STREAMING_RESPONSE to RESPONSE_FINISHED
[21:24:59][D][voice_assistant:445]: Desired state set to RESPONSE_FINISHED
[21:25:00][D][esp-idf:000]: W (252732) AUDIO_PIPELINE: There are no listener registered

[21:25:00][D][esp-idf:000]: I (252738) AUDIO_PIPELINE: audio_pipeline_unlinked

[21:25:00][D][esp-idf:000]: W (252743) AUDIO_ELEMENT: [i2s] Element has not create when AUDIO_ELEMENT_TERMINATE

[21:25:00][D][esp-idf:000]: I (252750) I2S: DMA queue destroyed

[21:25:00][D][esp-idf:000]: W (252757) AUDIO_ELEMENT: [filter] Element has not create when AUDIO_ELEMENT_TERMINATE

[21:25:00][D][esp-idf:000]: W (252766) AUDIO_ELEMENT: [raw] Element has not create when AUDIO_ELEMENT_TERMINATE

[21:25:00][D][voice_assistant:342]: Speaker has finished outputting all audio
[21:25:01][D][voice_assistant:439]: State changed from RESPONSE_FINISHED to IDLE
[21:25:01][D][voice_assistant:445]: Desired state set to IDLE
[21:25:01][W][component:237]: Component voice_assistant took a long time for an operation (222 ms).
[21:25:01][W][component:238]: Components should block for at most 30 ms.
[21:25:01][D][micro_wake_word:177]: State changed from IDLE to START_MICROPHONE
[21:25:01][D][micro_wake_word:115]: Starting Microphone
[21:25:01][D][micro_wake_word:177]: State changed from START_MICROPHONE to STARTING_MICROPHONE
[21:25:01][D][esp-idf:000]: I (253011) I2S: DMA Malloc info, datalen=blocksize=512, dma_buf_count=8

[21:25:01][D][esp-idf:000]: I (253019) I2S: I2S0, MCLK output by GPIO2

[21:25:01][D][esp-idf:000]: I (253026) AUDIO_PIPELINE: link el->rb, el:0x3d05c4ac, tag:i2s, rb:0x3d05c8c0

[21:25:01][D][esp-idf:000]: I (253034) AUDIO_PIPELINE: link el->rb, el:0x3d05c620, tag:filter, rb:0x3d05e900

[21:25:01][D][esp-idf:000]: I (253044) AUDIO_ELEMENT: [i2s-0x3d05c4ac] Element task created

[21:25:01][D][esp-idf:000]: I (253051) AUDIO_THREAD: The filter task allocate stack on external memory

[21:25:01][D][esp-idf:000]: I (253058) AUDIO_ELEMENT: [filter-0x3d05c620] Element task created

[21:25:01][D][esp-idf:000]: I (253064) AUDIO_ELEMENT: [raw-0x3d05c750] Element task created

[21:25:01][D][esp-idf:000]: I (253070) AUDIO_PIPELINE: Func:audio_pipeline_run, Line:359, MEM Total:16469395 Bytes, Inter:87748 Bytes, Dram:87748 Bytes


[21:25:01][D][esp-idf:000]: I (253074) AUDIO_ELEMENT: [i2s] AEL_MSG_CMD_RESUME,state:1

[21:25:01][D][esp-idf:000]: I (253079) AUDIO_ELEMENT: [filter] AEL_MSG_CMD_RESUME,state:1

[21:25:01][D][esp-idf:000]: I (253085) RSP_FILTER: sample rate of source data : 16000, channel of source data : 2, sample rate of destination data : 16000, channel of destination data : 1

[21:25:01][D][esp-idf:000]: I (253094) AUDIO_PIPELINE: Pipeline started

[21:25:01][D][esp_adf.microphone:273]: Microphone started
[21:25:01][D][micro_wake_word:177]: State changed from STARTING_MICROPHONE to DETECTING_WAKE_WORD