Hello swarm-intelligence, I am hoping to find an answer here to my problem.
I am trying to setup a Home Assistant voice pipeline on local hardware using:
- Raspberry Pi 5 running Home Assistant OS
- Beelink EQR 6 with Ryzen 7 6800U & 32GB RAM running local LLM, STT and TTS
- Satellite: ESP32-S3 AI Board from Waveshare: ESP32-S3 AI Smart Speaker Development Board, Onboard Dual Microphone Array, Supports Noise Reduction And Echo Cancellation, Surround RGB Lighting, ESP32 Audio | ESP32-S3-AUDIO-Board
My Voice Assistant Pipeline is configured as the following:
- Conversation agent: I am using my edited version of another HACS integration that enables an “external conversation agent” where all text-input is sent to this interface and can be handled here. I use it to differntiate between the home assistant local handling and the LLM enabling streaming. See: GitHub - kraideblaich/tts_streaming_external_conv_agent: A Home Assistant custom component that registers as a conversation agent and forwards requests to an external HTTP endpoint. · GitHub
My problem is that TTS playback does not actually start early, even though streaming appears to be active.
What I see in the pipeline debug log:
intent-progresstokens are streamed correctlytts_start_streaming: trueis emitted during generation after roughly 60 characters streamed- but the actual
tts-startevent only happens afterintent-end
Relevant timing from one example:
stt-end: 09:39:35.981457tts_start_streaming: true: 09:39:38.447758intent-end: 09:39:41.121806tts-start: 09:39:41.122015
So:
- streaming is requested about 2.67 seconds before intent end
- but TTS still starts only after the full intent response is complete
The threshold seems to be reached as expected:
the text streamed before tts_start_streaming: true is
"Die Herstellung von Glas erfolgt in mehreren Schritten. Zunächst "
which is 65 characters including spaces.
This makes it look like:
- token streaming works
- the streaming threshold is reached
- but the pipeline still buffers until
intent-endbefore starting actual TTS playback
Is this expected behavior with external_conversation_stream, or does this indicate that early TTS playback is not fully wired through in this pipeline path?
I also found this pull request which describes the same problem: TTS Streaming Delayed Until Conversation Agent Completes · home-assistant · Discussion #2877 · GitHub
Do you guys have any idea on how to fix this?