S3 Box 3 does not trigger whisper stt / streaming not working

So after setting up an assist pipeline and installing the latest voice assistant to a S3 box 3 local wake word works and opens a stream for the microphone but no response in whisper, home assistant shows assist as working, and I can see the triggering of whisper in assist debug but no text and seems to just hang.

esp32 screen also just shows white logo (not intent logo) and hangs until I mute from home assistant.

I tried switching to using wake word detection in the assist pipeline via openwakeword doesn’t work (No logs or wake wor detected)

S3 box 3 shows as assist in progress in ha even when the wake word is on device and the device is in detecting wake word state which seems incorrect to me.

Screenshot 2024-03-19 at 19.49.23

Here are the logs for the S3 box

[D][micro_wake_word:170]: State changed from START_MICROPHONE to STARTING_MICROPHONE
[D][esp-idf:000]: I (941588) I2S: DMA Malloc info, datalen=blocksize=512, dma_buf_count=8

[D][esp-idf:000]: I (941592) I2S: I2S0, MCLK output by GPIO2

[D][esp-idf:000]: I (941596) AUDIO_PIPELINE: link el->rb, el:0x3d05c5f0, tag:i2s, rb:0x3d05ca04

[D][esp-idf:000]: I (941598) AUDIO_PIPELINE: link el->rb, el:0x3d05c764, tag:filter, rb:0x3d05ea44
[D][esp-idf:000]: I (941603) AUDIO_ELEMENT: [i2s-0x3d05c5f0] Element task created

[D][esp-idf:000]: I (941605) AUDIO_THREAD: The filter task allocate stack on external memory

[D][esp-idf:000]: I (941608) AUDIO_ELEMENT: [filter-0x3d05c764] Element task created
[D][esp-idf:000]: I (941610) AUDIO_ELEMENT: [raw-0x3d05c894] Element task created



[D][esp-idf:000]: I (941614) AUDIO_ELEMENT: [i2s] AEL_MSG_CMD_RESUME,state:1

[D][esp-idf:000]: I (941617) AUDIO_ELEMENT: [filter] AEL_MSG_CMD_RESUME,state:1
[D][esp-idf:000]: I (941620) RSP_FILTER: sample rate of source data : 16000, channel of source data : 2, sample rate of destination data : 16000, channel of destination data : 1

[D][esp-idf:000]: I (941624) AUDIO_PIPELINE: Pipeline started

[D][esp_adf.microphone:273]: Microphone started
[D][micro_wake_word:170]: State changed from STARTING_MICROPHONE to DETECTING_WAKE_WORD
[D][esp32.preferences:114]: Saving 1 preferences to flash...
[D][esp32.preferences:143]: Saving 1 preferences to flash: 1 cached, 0 written, 0 failed
[D][micro_wake_word:121]: Wake Word Detected
[D][micro_wake_word:170]: State changed from DETECTING_WAKE_WORD to STOP_MICROPHONE
[D][micro_wake_word:127]: Stopping Microphone
[D][esp_adf.microphone:234]: Stopping microphone
[D][micro_wake_word:170]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE

[D][esp-idf:000]: W (1094412) AUDIO_PIPELINE: There are no listener registered

[D][esp-idf:000]: I (1094414) AUDIO_PIPELINE: audio_pipeline_unlinked
[D][esp-idf:000]: W (1094414) AUDIO_ELEMENT: [i2s] Element has not create when AUDIO_ELEMENT_TERMINATE

[D][esp-idf:000]: I (1094416) I2S: DMA queue destroyed

[D][esp-idf:000]: W (1094418) AUDIO_ELEMENT: [filter] Element has not create when AUDIO_ELEMENT_TERMINATE
[D][esp-idf:000]: W (1094420) AUDIO_ELEMENT: [raw] Element has not create when AUDIO_ELEMENT_TERMINATE

[D][esp_adf.microphone:285]: Microphone stopped
[D][micro_wake_word:170]: State changed from STOPPING_MICROPHONE to IDLE
[D][voice_assistant:416]: State changed from IDLE to START_PIPELINE
[D][voice_assistant:422]: Desired state set to START_MICROPHONE
[D][voice_assistant:118]: microphone not running
[D][voice_assistant:202]: Requesting start...
[D][voice_assistant:416]: State changed from START_PIPELINE to STARTING_PIPELINE
[D][voice_assistant:437]: Client started, streaming microphone
[D][voice_assistant:416]: State changed from STARTING_PIPELINE to START_MICROPHONE
[D][voice_assistant:422]: Desired state set to STREAMING_MICROPHONE
[D][voice_assistant:155]: Starting Microphone
[D][voice_assistant:416]: State changed from START_MICROPHONE to STARTING_MICROPHONE
[D][voice_assistant:523]: Event Type: 1
[D][voice_assistant:526]: Assist Pipeline running
[D][voice_assistant:523]: Event Type: 3
[D][voice_assistant:537]: STT started
[D][esp-idf:000]: I (1094475) AUDIO_PIPELINE: link el->rb, el:0x3d05c5f0, tag:i2s, rb:0x3d05ca04

[D][esp-idf:000]: I (1094477) AUDIO_PIPELINE: link el->rb, el:0x3d05c764, tag:filter, rb:0x3d05ea44

[D][esp-idf:000]: I (1094481) AUDIO_ELEMENT: [i2s-0x3d05c5f0] Element task created

[D][esp-idf:000]: I (1094481) AUDIO_THREAD: The filter task allocate stack on external memory

[D][esp-idf:000]: I (1094484) AUDIO_ELEMENT: [filter-0x3d05c764] Element task created

[D][esp-idf:000]: I (1094484) AUDIO_ELEMENT: [raw-0x3d05c894] Element task created


[D][esp-idf:000]: I (1094488) AUDIO_ELEMENT: [i2s] AEL_MSG_CMD_RESUME,state:1

[D][esp-idf:000]: I (1094490) AUDIO_ELEMENT: [filter] AEL_MSG_CMD_RESUME,state:1

[D][esp-idf:000]: I (1094493) RSP_FILTER: sample rate of source data : 16000, channel of source data : 2, sample rate of destination data : 16000, channel of destination data : 1
[D][esp-idf:000]: I (1094495) AUDIO_PIPELINE: Pipeline started

[W][component:214]: Component voice_assistant took a long time for an operation (0.22 s).
[W][component:215]: Components should block for at most 20-30ms.
[D][esp_adf.microphone:273]: Microphone started
[D][voice_assistant:416]: State changed from STARTING_MICROPHONE to STREAMING_MICROPHONE

There seem to be no logs in Whisper until I mute the box which I assume kills the microphone stream

The debug info from assist debug

stage: stt
run:
  pipeline: 01h3aqwm2apt1dftgvbzyfb4sw
  language: en
events:
  - type: run-start
    data:
      pipeline: 01h3aqwm2apt1dftgvbzyfb4sw
      language: en
    timestamp: "2024-03-19T19:25:40.454130+00:00"
  - type: stt-start
    data:
      engine: stt.faster_whisper
      metadata:
        language: en
        format: wav
        codec: pcm
        bit_rate: 16
        sample_rate: 16000
        channel: 1
    timestamp: "2024-03-19T19:25:40.454451+00:00"
stt:
  engine: stt.faster_whisper
  metadata:
    language: en
    format: wav
    codec: pcm
    bit_rate: 16
    sample_rate: 16000
    channel: 1
  done: false

thought it could be slow to process so left for a few minutes and nothing

My assist pipeline is pretty basic, using faster-whipser with tiny model, I’ve tried different models and no difference. Piper is standard and working fine when testing in assist debug

Also to fully verify the assist pipeline I used a microphone on pc in debug mode which works perfectly fine.

init_options:
  start_stage: wake_word
  end_stage: tts
  input:
    sample_rate: 44100
  pipeline: 01h3aqwm2apt1dftgvbzyfb4sw
  conversation_id: null
stage: done
run:
  pipeline: 01h3aqwm2apt1dftgvbzyfb4sw
  language: en
  runner_data:
    stt_binary_handler_id: 4
    timeout: 300
events:
  - type: run-start
    data:
      pipeline: 01h3aqwm2apt1dftgvbzyfb4sw
      language: en
      runner_data:
        stt_binary_handler_id: 4
        timeout: 300
    timestamp: "2024-03-19T19:30:39.246644+00:00"
  - type: wake_word-start
    data:
      entity_id: wake_word.openwakeword
      metadata:
        format: wav
        codec: pcm
        bit_rate: 16
        sample_rate: 16000
        channel: 1
      timeout: 3
    timestamp: "2024-03-19T19:30:39.246888+00:00"
  - type: wake_word-end
    data:
      wake_word_output:
        wake_word_id: ok_nabu_v0.1
        wake_word_phrase: ok nabu
        timestamp: 1990
    timestamp: "2024-03-19T19:30:43.479709+00:00"
  - type: stt-start
    data:
      engine: stt.faster_whisper
      metadata:
        language: en
        format: wav
        codec: pcm
        bit_rate: 16
        sample_rate: 16000
        channel: 1
    timestamp: "2024-03-19T19:30:43.479975+00:00"
  - type: stt-vad-start
    data:
      timestamp: 2490
    timestamp: "2024-03-19T19:30:44.445430+00:00"
  - type: stt-vad-end
    data:
      timestamp: 3340
    timestamp: "2024-03-19T19:30:46.149604+00:00"
  - type: stt-end
    data:
      stt_output:
        text: " Turn on Bedroom Light."
    timestamp: "2024-03-19T19:30:46.694521+00:00"
  - type: intent-start
    data:
      engine: homeassistant
      language: en
      intent_input: " Turn on Bedroom Light."
      conversation_id: null
      device_id: null
    timestamp: "2024-03-19T19:30:46.694586+00:00"
  - type: intent-end
    data:
      intent_output:
        response:
          speech:
            plain:
              speech: Turned on the light
              extra_data: null
          card: {}
          language: en
          response_type: action_done
          data:
            targets: []
            success:
              - name: Bedroom Light
                type: entity
                id: light.bedroom_light
            failed: []
        conversation_id: null
    timestamp: "2024-03-19T19:30:46.719617+00:00"
  - type: tts-start
    data:
      engine: tts.piper
      language: en_GB
      voice: en_GB-vctk-medium
      tts_input: Turned on the light
    timestamp: "2024-03-19T19:30:46.719676+00:00"
  - type: tts-end
    data:
      tts_output:
        media_id: >-
          media-source://tts/tts.piper?message=Turned+on+the+light&language=en_GB&voice=en_GB-vctk-medium
        url: >-
          /api/tts_proxy/104c89b5f9053e4751d03002aab527c96124bd77_en-gb_d3b473ba1f_tts.piper.mp3
        mime_type: audio/mpeg
    timestamp: "2024-03-19T19:30:46.719913+00:00"
  - type: run-end
    data: null
    timestamp: "2024-03-19T19:30:46.719940+00:00"
wake_word:
  entity_id: wake_word.openwakeword
  metadata:
    format: wav
    codec: pcm
    bit_rate: 16
    sample_rate: 16000
    channel: 1
  timeout: 3
  done: true
  wake_word_output:
    wake_word_id: ok_nabu_v0.1
    wake_word_phrase: ok nabu
    timestamp: 1990
stt:
  engine: stt.faster_whisper
  metadata:
    language: en
    format: wav
    codec: pcm
    bit_rate: 16
    sample_rate: 16000
    channel: 1
  done: true
  stt_output:
    text: " Turn on Bedroom Light."
intent:
  engine: homeassistant
  language: en
  intent_input: " Turn on Bedroom Light."
  conversation_id: null
  device_id: null
  done: true
  intent_output:
    response:
      speech:
        plain:
          speech: Turned on the light
          extra_data: null
      card: {}
      language: en
      response_type: action_done
      data:
        targets: []
        success:
          - name: Bedroom Light
            type: entity
            id: light.bedroom_light
        failed: []
    conversation_id: null
tts:
  engine: tts.piper
  language: en_GB
  voice: en_GB-vctk-medium
  tts_input: Turned on the light
  done: true
  tts_output:
    media_id: >-
      media-source://tts/tts.piper?message=Turned+on+the+light&language=en_GB&voice=en_GB-vctk-medium
    url: >-
      /api/tts_proxy/104c89b5f9053e4751d03002aab527c96124bd77_en-gb_d3b473ba1f_tts.piper.mp3
    mime_type: audio/mpeg

So it must be something between the voice assistant on the s3 box and the assist pipeline?

where should I look for more debugging or has anyone come across this before?

Also some other questions that I’ve been trying to figure out that might help:

  • how does the pipeline know when to stop processing audio, when should the audio clip be trimmed?
  • what is in between the audio being streamed from the s3 box and whisper, as in what is home assistant doing?

Any help would be greatly appreciated as it’s been driving me nuts trying to figure out what it could be.

some version details:
HA: 2023.3.1
Whisper: wyoming - 1.5.3, faster-whisper - 1.0.1 via rhasspy/wyoming-whisper:latest
ESP32 s3 Box 3: esphome.voice-assistant version 2.0 / ESPHome version 2024.2.2

Hi,
How are you running HA ?
Have you had the s3 box 3 working previously?
are the s3 box and HA on the same vlan?
cheers

HA runs in a Kubernetes cluster, with an nginx lb for routing.

I’ve only had the s3 box work as far as it responds to local wake word and can be controlled from HA (mute, backlight toggle, etc…), I’ve not tried it with any other firmware (only voice assistant)

yes they are both on the same flat vlan

HA requires all UDP ports to be open, as a random port is generated with each audio stream, so that is the first thing to check.

Ah interesting, and from a quick search it seems the port range is not defined / random

Guess I’ll need to try one of the workarounds

Just to close this out, heres the work-around I ended up with.

Applied the changes in this PR to the voice-assistant.py in the esphome module

Added the modified file as a config map to my home assistant k8s deployment

Updated the deployment to expose the set of UDP ports as nodePorts, this worked as the HA instance is exposed through an ingress so the s3 box tries to reach the host IP anyway.

Also in the PR there’s mention of work going on to change the va for esphome to use the esphome API to receive audio :pray: