How to make local TTS faster?

Hello,

Got Nuc 14 pro with intel U5-125H processor, 64GB RAM; NVME drive. Installed HASSOS bare-metal.
As a vision impaired user, I’d like to control all of my device by voice, so got voice preview edition as well.
In default full-local installation it takes about 3-5S to respond basic commands.
What do you recommend to get it faster?
Best regards.

A GPU or AI accelerator such as Google Coral or Jetson Nano. A regular CPU is not the best way to run language processing. You might want to consider running it on a separate host altogether.

Ps. it is probably speech to text (Whisper) slowing you down, not text to speech (Piper).

Coral works for AI and voice?
Details please

Use the ONNX ASR add-on with the Parakeet v2 model, paired with Piper for speech synthesis.

This setup delivers near-instantaneous responses for local voice commands — the 125H is plenty powerful.

Any extra delay usually comes from the LLM, so if you use it, it is important to choose a suitable model and provider.

1 Like

“Response time to basic commands” has several components, starting with wake word recognition and ending with the action, which may be accompanied by a TTS response. They all take a small amount of time - I don’t think there’s a magic bullet to make the whole experience faster. You have to fine tune each one, and even then delays will vary - a long TTS response means a longer pause, for example.

As @Edwin_D says, you can speed up local speech to text by throwing GPU at it, but that can get expensive and you’ll probably have to build the machine yourself.

I have a Willow Inference Server which gives response times for simple commands comparable to Amazon Alexa. The Nvidia graphics card and a gaming case to put it in cost about £300 on Ebay and I had to learn some basic Linux to set it up. At that time it only worked with ESP32-S3-BOX voice assistants, so that was another expense.

I don’t regret the cost at all, and it was a really interesting project, but I still have to adopt strategies to cover delays. If I ask for a weather report - which I know will generate a pause because it’s quite a long piece of TTS - there is always a random comment first along the lines of “Hang on a minute…”

I should think you’d need a separate machine to run an LLM at the same time.

Thank you very much for your answers and sorry for my falt, yes, STT needs to be faster, not TTS.
I have many raspberry-16gb models here, not sure they may help and can buy an external device if you recommend something to get fast response.
It is very exciting but, waiting about 5-10sec makes me sad.

No. I wouldn’t. Voice processing and LLM are two completely different animals.

For your current setup if you do what @mchk says you will have fast TTS. The Cpu there can handle the TTS (full stop notice no LLM) this would be for basic commands. But not llm drven voice.

But, fast TTS is only the first part. Then a competent llm. Cloud llm you will be NO LESS than 5 seconds minimum for a response if you don’t use local first processing for your simple commands like turn on a light.

With your setup you’re probably better starting with speech to phrase first and get that working (doesn’t require the llm so when you get it working for what it does should be near instant) but you won’t have the flexibility of llm interpretation.

Because for a competent local llm you need NO LESS THAN a Nvidia 3*** or better equipment GPU with at LEAST 8gb vram preferably 16.ignyiu can’t scratch that together don’t even think aboutna local llm you will end in tears. Use a cloud llm but then accept more than 5sec turnaround. (read there is no cheap accelerator like a coral in conjunction with llm - not happenibg this is gpu/NPU territory)

Is there any external device or solution to connect my Asus Nuc 14 pro for this process to make it faster? I can buy any hardware to connect to this machine or can buy any recommended mini pc for my HASSOS that contains GPU as well.
Thanks.

Your computer configuration is sufficient to provide a good STT and TTS experience. If you haven’t enabled the “Prioritize Matching Local Intentions” slider, please turn it on. If you prefer to delegate all operations to the LLM, a GPU is essential, but it won’t deliver a significant speed boost. With a 3060-12GB and Qwen2.5, and only 20-30 entities exposed, typical operations still require 3-4 seconds of response time. If you have high demands for speed, GPU + LLM is not the optimal solution.

You can likely strap an eGPU solution on by Thunderbolt 4 if it’s available on your minipc. 14th Gen should have TB4, mine does.

As said above you can already get fast TTS… TB4 +EGPU gets the inference. This is how I gave Friday a 5070ti to chew on.

Thank you very mutch for your answers.
A last point, I have a Mac Mini 2023 with M2 chip, 8GB RAM; will it be better for HASSOS than my NUC 14?
Mac has internal GPU as I know, but not sure for mac installation with USB zigbee and other devices.

I think that might do much better, but I do not know if it will know how to use the resources available.

Whisper does not need to run on the same device as HA (probably even better not) so I’d try to see if you can run only whisper there and see if it helps. You can always decide later if you want HA as a whole on it.

While the m2 Mac will be better for inference it’s only got mild capabilities at < 8g ram. It won’t be a happy llm experience. Instead id use what you have and get comfortable with accelerated speech. When you’re ready invest in a good GPU and EGPU it in…

I thought LLM was a future wish, not the core of this question? A piper/whisper assist pipeline is, as you said before, a different thing with different needs.

1 Like

Guys, please allow me to thank you again, this community is a game changer in my life. All the time, I’m getting valuable comments and responses.
It is taking much time to understand HA basics, thanks to you got many details now.

  1. Currently using ONNX and Piper, it takes about 5 sec to answer basic commands like “turn-on XXX” As I understand, using HA Cloud or local LLM will not change this result am I right?
  2. If I connect elevenlabs, Gemini or OpenAI, what happens? How they process my commands?
    Thanks.

1: very much depends on your pipeline config but yes there’s a way to add the other and keep what you have snappy that’s the most I’ll say there until you get clearer on your plan.

2: depends on if you’re using free or paid API

Hint: if you’re not paying… You’re the product. I will personally never, never ever ever push Friday’s prompt through a free API, inference or speech or otherwise. It’s either paid with contract guarantees of privacy or local or nothing for me.

This is quite a long time; an acceptable waiting time is less than 2 seconds before the command is executed and the response begins. This time is used to determine the end of speech and for the STT stage.
Check your assistant’s debug menu to see how long the STT stage lasted.

1 Like