Streaming LLM's responses into TTS for near-instant responses (works with HAVPE!)

gyrga · January 30, 2025, 8:39pm

New near-native solution: GitHub - eslavnov/ttmg_server: Talk To Me Goose Server

~~TLDR; Read below to learn how to get responses from TTS under 3 seconds even for huge texts.~~
I’ve been playing with my HAVPE devices and I love them, but I noticed that they don’t handle announcing long TTS responses that well. For example, if you ask chatgpt to tell you a story, you either end up timeouting or, if you manually increase the timeout, you can wait for dozens of seconds before getting a response. This happens because everything is sequential: you first wait for a response from chatgpt, then pass the whole response to the TTS and wait again for it to generate a long audio response.

But we know that LLMs can stream their responses, same goes for some TTS systems - so, hypothetically speaking, we could stream LLM’s response (before it’s even finished) into a TTS engine and save a bunch of time. I’ve written a small prototype that does exactly that and it seems to be working surprisingly well (takes on average only 3 seconds to start an audio stream).

Right now it supports OpenAI as an LLM provider. For TTS options, it supports OpenAI and Google Cloud. To make it work with home assistant (including voice devices), you need to run a python script and create a couple of automations, all details are available here: GitHub - eslavnov/llm-stream-tts: Stream LLMs responses directly into your TTS engine of choice

~~Basically, when you start a sentence with the defined words, it would switch to this streaming pipeline, which is perfect for stories, audiobooks, summaries, etc.~~

~~It’s still a very early work-in-progress, but I am curious to hear your thoughts!~~

ToonK · January 31, 2025, 1:03pm

Great idea! Would this potentially also work with Local LLM’s? (like when running Ollama with a model and running Piper and Whisper also locally? *I have these things running on a laptop, and HA running on a RPi4)

gyrga · January 31, 2025, 3:01pm

ollama supports streaming, so it should be possible. Whisper does not support streaming (I think?), but it will still benefit from splitting long responses into sentences. So it will probably be a bit slower than something like Google Cloud TTS (assuming everything else is equal), but still faster than the current situation.

On a side note, I’ve just added support for ElevenLabs!

Rudd-O · February 2, 2025, 10:51pm

Whisper needs to gain streaming support.

gyrga · February 7, 2025, 10:18am

I’ve updated my solution to create a near-native real-time streaming, see the new version here: GitHub - eslavnov/ttmg_server: Talk To Me Goose Server

It also works with Piper now!

heapmaster · May 16, 2025, 4:01am

Hope they will support local LLMs for TTS audio streaming