So after setting up an assist pipeline and installing the latest voice assistant to a S3 box 3 local wake word works and opens a stream for the microphone but no response in whisper, home assistant shows assist as working, and I can see the triggering of whisper in assist debug but no text and seems to just hang.
esp32 screen also just shows white logo (not intent logo) and hangs until I mute from home assistant.
I tried switching to using wake word detection in the assist pipeline via openwakeword doesn’t work (No logs or wake wor detected)
S3 box 3 shows as assist in progress in ha even when the wake word is on device and the device is in detecting wake word state which seems incorrect to me.
Here are the logs for the S3 box
[D][micro_wake_word:170]: State changed from START_MICROPHONE to STARTING_MICROPHONE
[D][esp-idf:000]: I (941588) I2S: DMA Malloc info, datalen=blocksize=512, dma_buf_count=8
[D][esp-idf:000]: I (941592) I2S: I2S0, MCLK output by GPIO2
[D][esp-idf:000]: I (941596) AUDIO_PIPELINE: link el->rb, el:0x3d05c5f0, tag:i2s, rb:0x3d05ca04
[D][esp-idf:000]: I (941598) AUDIO_PIPELINE: link el->rb, el:0x3d05c764, tag:filter, rb:0x3d05ea44
[D][esp-idf:000]: I (941603) AUDIO_ELEMENT: [i2s-0x3d05c5f0] Element task created
[D][esp-idf:000]: I (941605) AUDIO_THREAD: The filter task allocate stack on external memory
[D][esp-idf:000]: I (941608) AUDIO_ELEMENT: [filter-0x3d05c764] Element task created
[D][esp-idf:000]: I (941610) AUDIO_ELEMENT: [raw-0x3d05c894] Element task created
[D][esp-idf:000]: I (941614) AUDIO_ELEMENT: [i2s] AEL_MSG_CMD_RESUME,state:1
[D][esp-idf:000]: I (941617) AUDIO_ELEMENT: [filter] AEL_MSG_CMD_RESUME,state:1
[D][esp-idf:000]: I (941620) RSP_FILTER: sample rate of source data : 16000, channel of source data : 2, sample rate of destination data : 16000, channel of destination data : 1
[D][esp-idf:000]: I (941624) AUDIO_PIPELINE: Pipeline started
[D][esp_adf.microphone:273]: Microphone started
[D][micro_wake_word:170]: State changed from STARTING_MICROPHONE to DETECTING_WAKE_WORD
[D][esp32.preferences:114]: Saving 1 preferences to flash...
[D][esp32.preferences:143]: Saving 1 preferences to flash: 1 cached, 0 written, 0 failed
[D][micro_wake_word:121]: Wake Word Detected
[D][micro_wake_word:170]: State changed from DETECTING_WAKE_WORD to STOP_MICROPHONE
[D][micro_wake_word:127]: Stopping Microphone
[D][esp_adf.microphone:234]: Stopping microphone
[D][micro_wake_word:170]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE
[D][esp-idf:000]: W (1094412) AUDIO_PIPELINE: There are no listener registered
[D][esp-idf:000]: I (1094414) AUDIO_PIPELINE: audio_pipeline_unlinked
[D][esp-idf:000]: W (1094414) AUDIO_ELEMENT: [i2s] Element has not create when AUDIO_ELEMENT_TERMINATE
[D][esp-idf:000]: I (1094416) I2S: DMA queue destroyed
[D][esp-idf:000]: W (1094418) AUDIO_ELEMENT: [filter] Element has not create when AUDIO_ELEMENT_TERMINATE
[D][esp-idf:000]: W (1094420) AUDIO_ELEMENT: [raw] Element has not create when AUDIO_ELEMENT_TERMINATE
[D][esp_adf.microphone:285]: Microphone stopped
[D][micro_wake_word:170]: State changed from STOPPING_MICROPHONE to IDLE
[D][voice_assistant:416]: State changed from IDLE to START_PIPELINE
[D][voice_assistant:422]: Desired state set to START_MICROPHONE
[D][voice_assistant:118]: microphone not running
[D][voice_assistant:202]: Requesting start...
[D][voice_assistant:416]: State changed from START_PIPELINE to STARTING_PIPELINE
[D][voice_assistant:437]: Client started, streaming microphone
[D][voice_assistant:416]: State changed from STARTING_PIPELINE to START_MICROPHONE
[D][voice_assistant:422]: Desired state set to STREAMING_MICROPHONE
[D][voice_assistant:155]: Starting Microphone
[D][voice_assistant:416]: State changed from START_MICROPHONE to STARTING_MICROPHONE
[D][voice_assistant:523]: Event Type: 1
[D][voice_assistant:526]: Assist Pipeline running
[D][voice_assistant:523]: Event Type: 3
[D][voice_assistant:537]: STT started
[D][esp-idf:000]: I (1094475) AUDIO_PIPELINE: link el->rb, el:0x3d05c5f0, tag:i2s, rb:0x3d05ca04
[D][esp-idf:000]: I (1094477) AUDIO_PIPELINE: link el->rb, el:0x3d05c764, tag:filter, rb:0x3d05ea44
[D][esp-idf:000]: I (1094481) AUDIO_ELEMENT: [i2s-0x3d05c5f0] Element task created
[D][esp-idf:000]: I (1094481) AUDIO_THREAD: The filter task allocate stack on external memory
[D][esp-idf:000]: I (1094484) AUDIO_ELEMENT: [filter-0x3d05c764] Element task created
[D][esp-idf:000]: I (1094484) AUDIO_ELEMENT: [raw-0x3d05c894] Element task created
[D][esp-idf:000]: I (1094488) AUDIO_ELEMENT: [i2s] AEL_MSG_CMD_RESUME,state:1
[D][esp-idf:000]: I (1094490) AUDIO_ELEMENT: [filter] AEL_MSG_CMD_RESUME,state:1
[D][esp-idf:000]: I (1094493) RSP_FILTER: sample rate of source data : 16000, channel of source data : 2, sample rate of destination data : 16000, channel of destination data : 1
[D][esp-idf:000]: I (1094495) AUDIO_PIPELINE: Pipeline started
[W][component:214]: Component voice_assistant took a long time for an operation (0.22 s).
[W][component:215]: Components should block for at most 20-30ms.
[D][esp_adf.microphone:273]: Microphone started
[D][voice_assistant:416]: State changed from STARTING_MICROPHONE to STREAMING_MICROPHONE
There seem to be no logs in Whisper until I mute the box which I assume kills the microphone stream
The debug info from assist debug
stage: stt
run:
pipeline: 01h3aqwm2apt1dftgvbzyfb4sw
language: en
events:
- type: run-start
data:
pipeline: 01h3aqwm2apt1dftgvbzyfb4sw
language: en
timestamp: "2024-03-19T19:25:40.454130+00:00"
- type: stt-start
data:
engine: stt.faster_whisper
metadata:
language: en
format: wav
codec: pcm
bit_rate: 16
sample_rate: 16000
channel: 1
timestamp: "2024-03-19T19:25:40.454451+00:00"
stt:
engine: stt.faster_whisper
metadata:
language: en
format: wav
codec: pcm
bit_rate: 16
sample_rate: 16000
channel: 1
done: false
thought it could be slow to process so left for a few minutes and nothing
My assist pipeline is pretty basic, using faster-whipser with tiny model, I’ve tried different models and no difference. Piper is standard and working fine when testing in assist debug
Also to fully verify the assist pipeline I used a microphone on pc in debug mode which works perfectly fine.
init_options:
start_stage: wake_word
end_stage: tts
input:
sample_rate: 44100
pipeline: 01h3aqwm2apt1dftgvbzyfb4sw
conversation_id: null
stage: done
run:
pipeline: 01h3aqwm2apt1dftgvbzyfb4sw
language: en
runner_data:
stt_binary_handler_id: 4
timeout: 300
events:
- type: run-start
data:
pipeline: 01h3aqwm2apt1dftgvbzyfb4sw
language: en
runner_data:
stt_binary_handler_id: 4
timeout: 300
timestamp: "2024-03-19T19:30:39.246644+00:00"
- type: wake_word-start
data:
entity_id: wake_word.openwakeword
metadata:
format: wav
codec: pcm
bit_rate: 16
sample_rate: 16000
channel: 1
timeout: 3
timestamp: "2024-03-19T19:30:39.246888+00:00"
- type: wake_word-end
data:
wake_word_output:
wake_word_id: ok_nabu_v0.1
wake_word_phrase: ok nabu
timestamp: 1990
timestamp: "2024-03-19T19:30:43.479709+00:00"
- type: stt-start
data:
engine: stt.faster_whisper
metadata:
language: en
format: wav
codec: pcm
bit_rate: 16
sample_rate: 16000
channel: 1
timestamp: "2024-03-19T19:30:43.479975+00:00"
- type: stt-vad-start
data:
timestamp: 2490
timestamp: "2024-03-19T19:30:44.445430+00:00"
- type: stt-vad-end
data:
timestamp: 3340
timestamp: "2024-03-19T19:30:46.149604+00:00"
- type: stt-end
data:
stt_output:
text: " Turn on Bedroom Light."
timestamp: "2024-03-19T19:30:46.694521+00:00"
- type: intent-start
data:
engine: homeassistant
language: en
intent_input: " Turn on Bedroom Light."
conversation_id: null
device_id: null
timestamp: "2024-03-19T19:30:46.694586+00:00"
- type: intent-end
data:
intent_output:
response:
speech:
plain:
speech: Turned on the light
extra_data: null
card: {}
language: en
response_type: action_done
data:
targets: []
success:
- name: Bedroom Light
type: entity
id: light.bedroom_light
failed: []
conversation_id: null
timestamp: "2024-03-19T19:30:46.719617+00:00"
- type: tts-start
data:
engine: tts.piper
language: en_GB
voice: en_GB-vctk-medium
tts_input: Turned on the light
timestamp: "2024-03-19T19:30:46.719676+00:00"
- type: tts-end
data:
tts_output:
media_id: >-
media-source://tts/tts.piper?message=Turned+on+the+light&language=en_GB&voice=en_GB-vctk-medium
url: >-
/api/tts_proxy/104c89b5f9053e4751d03002aab527c96124bd77_en-gb_d3b473ba1f_tts.piper.mp3
mime_type: audio/mpeg
timestamp: "2024-03-19T19:30:46.719913+00:00"
- type: run-end
data: null
timestamp: "2024-03-19T19:30:46.719940+00:00"
wake_word:
entity_id: wake_word.openwakeword
metadata:
format: wav
codec: pcm
bit_rate: 16
sample_rate: 16000
channel: 1
timeout: 3
done: true
wake_word_output:
wake_word_id: ok_nabu_v0.1
wake_word_phrase: ok nabu
timestamp: 1990
stt:
engine: stt.faster_whisper
metadata:
language: en
format: wav
codec: pcm
bit_rate: 16
sample_rate: 16000
channel: 1
done: true
stt_output:
text: " Turn on Bedroom Light."
intent:
engine: homeassistant
language: en
intent_input: " Turn on Bedroom Light."
conversation_id: null
device_id: null
done: true
intent_output:
response:
speech:
plain:
speech: Turned on the light
extra_data: null
card: {}
language: en
response_type: action_done
data:
targets: []
success:
- name: Bedroom Light
type: entity
id: light.bedroom_light
failed: []
conversation_id: null
tts:
engine: tts.piper
language: en_GB
voice: en_GB-vctk-medium
tts_input: Turned on the light
done: true
tts_output:
media_id: >-
media-source://tts/tts.piper?message=Turned+on+the+light&language=en_GB&voice=en_GB-vctk-medium
url: >-
/api/tts_proxy/104c89b5f9053e4751d03002aab527c96124bd77_en-gb_d3b473ba1f_tts.piper.mp3
mime_type: audio/mpeg
So it must be something between the voice assistant on the s3 box and the assist pipeline?
where should I look for more debugging or has anyone come across this before?
Also some other questions that I’ve been trying to figure out that might help:
- how does the pipeline know when to stop processing audio, when should the audio clip be trimmed?
- what is in between the audio being streamed from the s3 box and whisper, as in what is home assistant doing?
Any help would be greatly appreciated as it’s been driving me nuts trying to figure out what it could be.
some version details:
HA: 2023.3.1
Whisper: wyoming - 1.5.3, faster-whisper - 1.0.1 via rhasspy/wyoming-whisper:latest
ESP32 s3 Box 3: esphome.voice-assistant version 2.0 / ESPHome version 2024.2.2