Speeding up Piper on slow hardware

I’d rather not self-promote, as I’ve already shared a post in the relevant section.
But after using this method for a few days, I can confirm its usefulness, so I will duplicate the information in this section.

Now my J4205 can easily play back responses from LLMs.

In the future, I expect improvements to the Wyoming protocol, local servers, and cloud integrations for various TTS engines with streaming support. Then this method will no longer be as relevant.

But for now, my integration might already prove useful to someone out there.

  • Copy the streaming_tts_proxy folder from this repository to your custom_components folder.
  • Restart the system.
  • Add the streaming tts proxy integration.
  • Specify the host and port (use core-piper:10200 for the add-on).
  • Specify the language and voice in the settings.
  • Start testing.

The only requirement is that your server must be capable of generating 1 second of speech in no more than 1 second (RTF = 1), preferably slightly faster.

I tried to reflect the work with different satellite versions in the diagrams; there are some inaccuracies, but the idea should be clear.

Great job, this is something I was thinking about myself recently as it can feel slow waiting for llm responses via HA.
Do you know if your solution would also apply for tts coming via open ai tts to HA? (GitHub - sfortis/openai_tts: Custom TTS component for Home Assistant. Utilizes the OpenAI speech engine or any compatible endpoint to deliver high-quality speech. Optionally offers chime and audio normalization features.)
I use this with Kokoro instead of Piper as it sounds much more realistic.

My variant is designed for the Wyoming protocol.
However, developers of custom voice integrations can already adapt them to the new method.
Also, the author will need to decide how to segment the text. Some TTS services likely support true streaming input, which could yield even better results.

1 Like

Thanks, I use this GitHub - remsky/Kokoro-FastAPI: Dockerized FastAPI wrapper for Kokoro-82M text-to-speech model w/CPU ONNX and NVIDIA GPU PyTorch support, handling, and auto-stitching to run Kokoro which appears to support streaming… but I guess that OpenAI component sat in-between would need to as well. Anyway, I don’t want to derail your thread into a discussion about Kokoro. But great job on what you’ve done, I can imagine it being super useful for anyone running piper :+1:

That’s quite nice. I configured and a command like “turn on/off air conditioner” gets way faster. I hope this can be improved so becomes official.

Integration does not affect the speed of command execution, only the method of generating the voice response. The benefits of this become apparent when interacting with LLMs.
It’s also worth noting that the development team set a 60-character threshold, after which the streaming synthesis mechanism kicks in.
This was done to cache short standard responses, which is only possible with the old synthesis method. Streaming uses direct chunk transmission to the satellite and does not create a cache.

What I mean, it starts answering sooner than later after the command is done.
I know that because when I say turn off ac, the AC gives me a beep.
So it’s command-beep-answer, without any delay.

@mchk It seems your change is quite in prod, am I right? Piper 1.6.0 supports it and Home Assistant 2025.7 will add the final bits. That’s right?

Everything is correct, Wyoming will receive a library update in the next release. Piper is already prepared for this update—it can be connected directly. @synesthesiam implemented a very smart text processing solution that accounts for the nuances of many languages. Authors of other servers can use this implementation as a reference.
After the 2025.7 release, I’ll add handling for the supports_synthesize_streaming key from the server in my integration to avoid duplicating data processing.

Those are the kind of side effects we love

1 Like

Another question: after Home Assistant, Wyoming, Piper, etc. implementing support for streaming, what will be the use case for your solution? Proxy to allow streaming with other TTS solutions that still don’t support the feature?

I will use it because of the backup server functionality.

Summary

If I succeed, I will implement end-to-end interaction with a new type of server.

However, overall, the project will lose relevance.

1 Like