TTS Streaming of LLM Chat Messages activated but not working

Hello swarm-intelligence, I am hoping to find an answer here to my problem.
I am trying to setup a Home Assistant voice pipeline on local hardware using:

My Voice Assistant Pipeline is configured as the following:

My problem is that TTS playback does not actually start early, even though streaming appears to be active.

What I see in the pipeline debug log:

  • intent-progress tokens are streamed correctly
  • tts_start_streaming: true is emitted during generation after roughly 60 characters streamed
  • but the actual tts-start event only happens after intent-end

Relevant timing from one example:

  • stt-end: 09:39:35.981457
  • tts_start_streaming: true: 09:39:38.447758
  • intent-end: 09:39:41.121806
  • tts-start: 09:39:41.122015

So:

  • streaming is requested about 2.67 seconds before intent end
  • but TTS still starts only after the full intent response is complete

The threshold seems to be reached as expected:
the text streamed before tts_start_streaming: true is

"Die Herstellung von Glas erfolgt in mehreren Schritten. Zunächst "

which is 65 characters including spaces.

This makes it look like:

  1. token streaming works
  2. the streaming threshold is reached
  3. but the pipeline still buffers until intent-end before starting actual TTS playback

Is this expected behavior with external_conversation_stream, or does this indicate that early TTS playback is not fully wired through in this pipeline path?

I also found this pull request which describes the same problem: TTS Streaming Delayed Until Conversation Agent Completes · home-assistant · Discussion #2877 · GitHub

Do you guys have any idea on how to fix this?

You have not specified the TTS you are using.
In any case, look at the TTS component/server logs to understand what is happening.

Oh sorry, I am using the wyoming piper hosted on my beelink edge server: GitHub - rhasspy/wyoming-piper: Wyoming protocol server for Piper text to speech system · GitHub.
Streaming should be enabled by default for piper and I use an Ollama Model also running on my Beelink Edge Server which can also stream.

Looking at the raw logs of the voice assistant, it seems to be working, I get deltas containing single words from Ollama and when close to 65 charactes tts_start_streaming: true is set.

As I said, I guess the problem seems to be that the tts-start is not triggered until the Ollama Stream is finished.

Streaming audio synthesis begins before the tts-start stage. However, the system has a 60-character startup threshold. If your conversation component is implemented correctly and the system receives text chunks from the generator, everything should work. Test with longer text. If it doesn’t work, the problem is with the component.

I have tried it with longer responses and with different devices (ESP32-S3 Satellite, Browser, iOS-App) but it still seems like the response starts when the whole streaming is done. I read somewhere that it may have to do with sentences being too long, because only full sentences and not single words are streamed to piper which made sense.
I therefore prompted the Ollama model to start the answer with a short sentence followed by the actual answer. Even with shorter sentence the logs of the piper model show that the second the last word is generated and streamed to the HAOS the piper model starts generating.

Maybe I am doing something else wrong but I can’t seem to find the answer to what I’m doing wrong…

Use the standard integration for ollama. Since it’s known to work correctly, you’ll be able to diagnose the system.

As for Wyoming Piper, streaming mode (slicing incoming text data into sentences) is enabled by default starting with version 2+.