Streaming in TTS integrations

Hello everyone,
I am wondering if it is possible to stream the output audio in TTS integrations within a voice assistant pipeline. Waiting for the entire TTS process to complete takes a long time especially if the output text is lengthy, but many TTS providers support streaming in their APIs. I have looked into some TTS integrations, and they seem to return the output audio file as a result. Is it possible to stream the output audio chunks in TTS integrations? Does the pipeline structure support this?

3 Likes

This was added to the openai conversation in 2025.3

That’s just for text, not for TTS

3 Likes

I am also interested in this. I use elevenlabs TTS and even with the flash model, sometimes the responses are too long and cause a timeout which prevents any audio from being played at all. Taking advantage of ElevenLabs’ streaming TTS endpoint would allow for quicker responses and fewer (if any) timeouts.

This and other PRs show that the feature is in development. It might be available as early as this year.

2 Likes

I’ve noticed this too when using ElevenLabs - have you been able to figure out where that timeout is set?

1 Like

Not yet. I’m looking too.

Is this the one? TTS fails for longer responses (CancelledError + ESP_FAIL) · Issue #355 · esphome/home-assistant-voice-pe · GitHub

Yup that’s the one.

It seems that there are various hidden limits (timeouts at various stages, limits on the size of the stt/tts payloads, etc) that are not adjustable and not presented to the user, resulting in silent failures (I don’t even have anything in the logs; I might be looking in the wrong place, though).

The failure mode I’m seeing is that it just stop responding, but the audio gets successfully generated and I can play it in the Assit Pipeline Debug page.

1 Like

And it’s irritating as heck because that means you burned the tokens and you don’t get to use em.

yes, it seems they are adding this feature. here are the pr’s I have found related to this feature:

2 Likes