Speech-to-text phase immediately ends

(This is about configuring ESPHome as voice assistant, I’m not sure if this belongs in the ESPHome or in the Voice Assistant category.)

I’m trying to get the Voice PE working with a server-side wake word.

I have removed micro_wake_word from the config and I have set use_wake_word to true:

voice_assistant:
  id: va
  microphone: comm_mic
  media_player: nabu_media_player
  use_wake_word: true
  noise_suppression_level: 0
  auto_gain: 0 dbfs
  volume_multiplier: 1

I have assigned voice_assistant.start to the double click and when I do the double click it successfully starts waiting for the wake word.

My issue is now that when I say the wake word, it doesn’t wait for any command, it immediately ends the recording:

  - type: wake_word-end
    data:
      wake_word_output:
        wake_word_id: americano
        wake_word_phrase: americano
        timestamp: 56570
    timestamp: "2025-01-06T06:00:05.785091+00:00"
  - type: stt-start
    data:
      engine: stt.rhasspy_speech
      metadata:
        language: en_US
        format: wav
        codec: pcm
        bit_rate: 16
        sample_rate: 16000
        channel: 1
    timestamp: "2025-01-06T06:00:05.785806+00:00"
  - type: stt-vad-start
    data:
      timestamp: 56710
    timestamp: "2025-01-06T06:00:05.885309+00:00"
  - type: stt-vad-end
    data:
      timestamp: 57610
    timestamp: "2025-01-06T06:00:06.784492+00:00"

As you can see stt-vad-end happens just a few milliseconds after stt-vad-start.

The first question I need to answer in order to troubleshoot this is: Where is the silence detection supposed to happen, in ESPHome, in Home Assistant or in the speech-to-text engine (wyoming-rhasspy-speech in my case)?

Turns out, I simply didn’t start speaking fast enough after saying the wake word. Is there a way to make the VAD timeout longer? I feel quite rushed. :smile: