Voice PE β†’ Play Replies on an External Media Playerer

Hi all.

Been trying for some time to get the Voice PE to output its TTS to an external media player.

I use a Windows PC (oldschool HTPC setup) in my living room with a good sound system and wanted the output here instead of the crapy builtin speaker.

Before this i had zero knowledge about ESPHome but i have been running HA instance for 4-5 years.

I took alot of inspiration from this thread Redirect Voice PE Replies to Sonos - Community Guides - Home Assistant Community.

ChatGPT helped me all the way here with the code.

Firstly i installed HASS.Agent on my PC(called sofa) and got it working.
Then i installed ESPHome and imported the Voice PE.

Voice PE β†’ Play Replies on an External Media Player (No Double Audio)

Goal

Build a Voice PE satellite that:

  • :microphone: Listens locally (microphone, wake word, LEDs all work normally)
  • :brain: Runs the full Assist pipeline in Home Assistant
  • :speaker_high_volume: Plays TTS replies on an external media player (e.g. media_player.sofa)
  • :zipper_mouth_face: Does not speak locally at the same time
  • :brick: Works reliably, without race conditions or ESPHome YAML errors

In this setup, media_player.sofa is a Windows PC running Hass.Agent, exposed to Home Assistant as a media player.

The Robust Solution (Recommended)

:brain: Key Idea

Let Voice PE keep generating TTS, but:

  • Mute the local Voice PE speaker
  • Capture the generated TTS URL
  • Hand off playback to Home Assistant
  • Let HA play the reply on any media player (here: a Windows PC via Hass.Agent)

This is done using:

  • an input_text helper as a bridge
  • a Home Assistant script for playback
  • a small ESPHome override

ESPHome (Voice PE override)

What this does

  • Ducks the local mixer
  • Mutes the local Voice PE speaker
  • Saves the TTS URL into Home Assistant
  • Triggers the HA playback script
  • Restores everything afterward

ESPHome YAML (override only)

substitutions:
  name: home-assistant-voice-095a6b
  friendly_name: Home Assistant Voice 095a6b

packages:
  Nabu Casa.Home Assistant Voice PE: github://esphome/home-assistant-voice-pe/home-assistant-voice.yaml

esphome:
  name: ${name}
  name_add_mac_suffix: false
  friendly_name: ${friendly_name}

api:
  encryption:
    key: ******************

wifi:
  ssid: !secret wifi_ssid
  password: !secret wifi_password

# ------------------------------------------------------------
# Voice PE reply redirect (NO double audio + no "blip" after idle)
#
# Home Assistant prerequisites:
# 1) Helper: input_text.voice_pe_tts_url
# 2) Script: script.voice_pe_play_reply_on_sofa
#    (reads input_text.voice_pe_tts_url and plays it on media_player.sofa)
# ------------------------------------------------------------

voice_assistant:
  # PRE-MUTE EARLY to prevent the "local starts speaking briefly then stops" issue after idle
  on_intent_progress:
    - if:
        condition:
          lambda: 'return !x.empty();'
        then:
          - logger.log:
              level: DEBUG
              format: "Redirect: pre-muting local Voice PE speaker before TTS begins."
          - media_player.volume_set:
              id: external_media_player
              volume: 0.0

  # Duck hard during TTS and ensure local speaker stays muted
  on_tts_start:
    - logger.log:
        level: INFO
        format: "Redirect: ducking local mixer + muting Voice PE speaker."
    - mixer_speaker.apply_ducking:
        id: media_mixing_input
        decibel_reduction: 51
        duration: 0s
    - media_player.volume_set:
        id: external_media_player
        volume: 0.0

  # Save the TTS proxy URL and trigger HA playback on sofa (Windows PC via Hass.Agent)
  on_tts_end:
    - logger.log:
        level: INFO
        format: "Redirect: saving TTS URL to HA helper + starting sofa playback script."
    - homeassistant.service:
        service: input_text.set_value
        data:
          entity_id: input_text.voice_pe_tts_url
          value: !lambda |-
            return x;

    - homeassistant.service:
        service: script.turn_on
        data:
          entity_id: script.voice_pe_play_reply_on_sofa

  # Restore state when pipeline is fully finished
  on_end:
    - wait_until:
        not:
          voice_assistant.is_running:
    - mixer_speaker.apply_ducking:
        id: media_mixing_input
        decibel_reduction: 0
        duration: 0s
    - media_player.volume_set:
        id: external_media_player
        volume: 1.0

I added this to script.yaml:

voice_pe_play_reply_on_sofa:
  alias: Voice PE – Play reply on sofa
  mode: restart
  sequence:
  - variables:
      url: '{{ states(''input_text.voice_pe_tts_url'') }}'
  - condition: template
    value_template: '{{ url.startswith(''http'') }}'
  - target:
      entity_id: media_player.sofa
    data:
      media_content_id: '{{ url }}'
      media_content_type: music
    action: media_player.play_media

Added this to configuration.yaml:

input_text:
  voice_pe_tts_url:
    name: Voice PE last TTS URL
    max: 255

Why This Works (and Why Others Fail)

  • :check_mark: No unsupported ESPHome YAML
  • :check_mark: No direct media_player hijacking
  • :check_mark: No timing race conditions
  • :check_mark: Works with any HA media player:
    • Sonos
    • Music Assistant
    • Chromecast
    • Windows PC via Hass.Agent
  • :check_mark: Voice PE remains fully functional as a satellite

Voice PE still believes it is playing locally β€” but it’s muted β€” while Home Assistant takes over actual playback.


Result

You end up with a clean, professional Voice Assistant setup:

  • One device listens
  • Another device speaks
  • No echo
  • No hacks
  • No flakiness

This is effectively how commercial multi-room assistants work β€” just implemented with full local control.


If you want, this approach can easily be extended to:

  • restore exact previous volume
  • room-aware replies
  • multi-room announcements
  • Music Assistant ducking
  • LED sync with external playback

But as-is, this is already a production-grade solution.

4 Likes

Would this work with the β€œ$13 Voice Assistant” aka M5 Atom Echo? Thanks for the great work.

Thank you so much! This is awesome :heart: I skipped the HASS.Agent part. Just needed my Home Assistant Voice PE to stop stuttering when replying (because it made me crazy). And now it TTS replies on my Google Nest Mini, or whatever media_player.* i want, with perfect flow and sound. Yess, I love it, VERY NAJS!!! :hugs:

[Help] Voice PE + Sonos hybrid setup: TTS plays on PE instead of Sonos, volume restore issues

Setup goal:
ESPHome Voice PE (Home Assistant Voice PE, ESP32-S3) should handle only microphone/wake word detection. TTS responses should play on a Sonos
speaker (Arbeitszimmer) instead of the built-in PE speaker. During listening/processing, Sonos should be ducked to ~2% volume, then restored after
the response.


Hardware/Software:

  • Home Assistant 2026.6.3
  • ESPHome Voice PE (home-assistant-voice-0943c6)
  • Sonos (media_player.arbeitszimmer)
  • Pipeline: Nabu Casa Cloud STT + TTS (ElkeNeural, de-DE), Gemini conversation agent

ESPHome YAML (current):

voice_assistant:
on_tts_start:
- media_player.stop:
id: external_media_player
on_tts_end:
- homeassistant.service:
service: input_text.set_value
data:
entity_id: input_text.voice_pe_tts_url
value: !lambda 'return x;'
- homeassistant.service:
service: script.turn_on
data:
entity_id: script.voice_pe_play_reply_on_sonos

HA Script (voice_pe_play_reply_on_sonos):
sequence:
- variables:
url: "{{ states('input_text.voice_pe_tts_url') }}"
saved_vol: "{{ states('input_number.voice_pe_sonos_pre_duck_volume') | float(0.14) }}"
- condition: template
value_template: "{{ url.startswith('http') }}"
- action: media_player.play_media
target:
entity_id: media_player.arbeitszimmer
data:
media_content_id: "{{ url }}"
media_content_type: music
- action: media_player.volume_set
target:
entity_id: media_player.arbeitszimmer
data:
volume_level: 0.3
- delay:
seconds: 1
- wait_for_trigger:
- trigger: state
entity_id: media_player.arbeitszimmer
to: idle
- trigger: state
entity_id: media_player.arbeitszimmer
to: paused
timeout:
seconds: 30
continue_on_timeout: true
- action: sonos.restore
data:
entity_id: media_player.arbeitszimmer
with_group: false
- delay:
milliseconds: 300
- action: media_player.volume_set
target:
entity_id: media_player.arbeitszimmer
data:
volume_level: "{{ saved_vol }}"

HA Automation (ducking):
alias: CLAUDE_Voice PE Sonos Ducking
triggers:
- trigger: state
entity_id: assist_satellite.home_assistant_voice_0943c6_assist_satellit
to: listening
mode: single
max_exceeded: silent
actions:
- action: input_number.set_value
target:
entity_id: input_number.voice_pe_sonos_pre_duck_volume
data:
value: "{{ state_attr('media_player.arbeitszimmer', 'volume_level') | float(0.14) }}"
- action: media_player.volume_set
target:
entity_id: media_player.arbeitszimmer
data:
volume_level: 0.02
- action: sonos.snapshot
data:
entity_id: media_player.arbeitszimmer
with_group: false


Problems encountered and attempted solutions:

  1. TTS URL is FLAC β€” Sonos announce: true refuses to play it
  • on_tts_end provides a URL like http://ha-ip/api/tts_proxy/xxxx.flac
  • Sonos logs: AudioClip announce only supports MP3 and WAV; .flac will be attempted as a clip anyway β€” and then silently fails
  • Tried: renaming .flac β†’ .mp3 in the URL β†’ 404, the proxy token is format-bound
  • Tried: media_player.play_media with announce: true and media_content_type: music β†’ Sonos goes to playing state but plays nothing audible
  • Tried: tts.cloud_say with the response text β†’ but on_intent_progress does NOT provide the response text β€” it provides the TTS URL, same as
    on_tts_end
  • Working solution: media_player.play_media with media_content_type: music without announce: true β€” Sonos plays FLAC fine this way
  1. PE still speaks locally even when script runs
  • on_tts_start β†’ media_player.volume_set id: external_media_player volume: 0.0 does NOT stop local playback β€” TTS runs through a different internal
    audio path
  • mixer_speaker.apply_ducking decibel_reduction: 51 also doesn't fully mute it
  • Working solution: media_player.stop id: external_media_player in on_tts_start stops local TTS
  1. Volume not restored after TTS
  • sonos.snapshot / sonos.restore does NOT reliably restore volume β€” only restores the stream
  • Root cause: snapshot is taken after ducking is already applied (volume=0.02), so restore brings back 0.02
  • Tried: snapshot before ducking β†’ Sonos briefly "flickers" (resumes stream at full volume for ~0.5s) during snapshot
  • Tried: manual volume_set after restore using input_number β†’ works, but conflicts with voice volume intents
  • Current approach: save volume to input_number first β†’ set volume to 0.02 β†’ snapshot β†’ TTS plays β†’ sonos.restore (restores stream) β†’ manual
    volume_set from input_number
  • Remaining issue: the 0.5s volume flicker during snapshot has not been fully eliminated
  1. Music doesn't resume after TTS / resumes after 30s delay
  • wait_for_trigger: idle/paused fires immediately (before TTS actually plays) because Sonos briefly hits idle during play_media initialization
  • Fix: added delay: 1s after play_media before starting the wait
  • Alternatively: wait_for_trigger: playing first, then idle β€” but if TTS is very short, the second wait misses the idle transition and falls
    through to the 30s timeout
  • Current approach: delay: 1s + single wait_for_trigger on idle/paused with 30s timeout β€” mostly works but occasionally still hits the full timeout
  1. Voice volume intents ("make it quieter") are overridden by volume restore
  • When user says "OK Nabu, turn volume down", HA executes the intent during responding state
  • But after TTS, the script restores volume from input_number (saved at listening time) β†’ new volume is lost
  • Attempted: second automation watching for volume changes during responding to update input_number β†’ backfired because the script's own
    volume_set: 0.3 (TTS volume) was saved instead
  • Not yet solved

Questions for the community:

  1. Is there a cleaner way to get sonos.snapshot to not cause a brief audio flicker?
  2. Is there any way to intercept the TTS response text (not the URL) from the pipeline in HA, so we can use tts.speak directly on Sonos instead of
    routing through ESPHome?
  3. Has anyone found a reliable wait_for_trigger pattern that works for both short and long TTS responses on Sonos?
  4. Any ideas for handling voice volume intents while also doing volume management in the script?