Best hardware to run voice locally?

I’m currently running Home Assistant on a Blue I bought a few years ago, and I’m using Alexa for Voice. I want to switch to Home Assistant Voice and run everything local. This will be a dedicated system with no Docker or VM usage. What is a good prebuilt system to get good performance with voice without being overkill?

Hello morikaweb,

If you are going to locally generate TTS, STT, and run an LLM locally, there is no such thing as going overkill…
You will need all the GPU and CPU’s help you can afford.

3 Likes

Thanks, that’s kinda what I thought. I suppose my real question is what companies are good?

I am looking at something like a HP Elite Desk or a Beelink or Minisfourm system but I’m not sure what company is more reliable. Also what is better for this job, Intel or AMD? I’m very new to this Voice/LLM stuff so any advice is appreciated.

Hi @morikaweb, it really depends on what you want. If you want to run serious LLMs locally, you’ll need a powerful GPU like the 5090 or a workstation GPU. However, if you prefer to rely on commercial LLMs like ChatGPT, there are two options:

  1. Run the voice pipeline on an iGPU – Go with the Intel N100. Most people would agree that the N100 is the most cost-effective option for running Whisper and Piper locally. 16GB of RAM will be enough for large models in Whisper/Piper.
  2. Run the voice pipeline on an NVIDIA GPU – this provides the best performance with blazing-fast STT and TTS. It also depends on the model size, but 8GB of VRAM will be more than enough for most models. However, you’ll need to build your own PC, which can be costly and power-hungry.
2 Likes

Thanks again, based on this and my research I have decided to start with a Beelink EQ14. If I switch to a higher spec system down the road the Beelink will be a great work PC.

I also know the LAN sucks, but I can fix that with a dongle. So I just wanted to confer that these specs should be good?

Beelink EQ14

Would the Beelink Mini PC, Mini S12 Pro Intel 12th N100 be suitable for this?

What sucks about the LAN?

No.

Before any of you go any further, have a look at this:

There is no commercial option. As @sayanova says, to get local TTS you need a high-end graphics card and you’ll probably have to buy a gaming case to put it in. The results are great - comparable to Alexa - but it’s a lot of work and very expensive.

I’d have to agree you need a good GPU for ollama setup on home assistant. I’ve setup ollama on a Mac mini m4, the problem you are going to run into is home assistant sends about 8k context everytime with all the devices, instruction prompt, tool calling etc. When doing this with a mac mini m4 that translates to 10-16 sec delay to turn off one bloody light because mac mini does not have enough memory bandwidth to get to first token quickly. If I was just talking to it normally I’d get quick reply with 20 tokens a second on a 7-8B model.

For context a mac mini m4 only has about 100-200GB/s memory bandwidth. A 3090 or above has at least 900GB/s memory bandwidth, that is what you’ll need it you want a quick response

Personally I just stuff home assistant in a VM on my server, a cheap Nvidia v100 with ECC memory is like 300 bucks on eBay, about same memory bandwidth as a 3090.

This is a very helpful comment!

I have been trying to figure out why ollama webUI and SSH is real time response on my old Intel MacBook (basic LLM models) but is taking 2-3 minutes to respond via home assistant assist.

I thought the solution might be a 16gb unified memory M4 [quote=“syleishere, post:9, topic:849340, full:true”]
I’d have to agree you need a good GPU for ollama setup on home assistant. I’ve setup ollama on a Mac mini m4, the problem you are going to run into is home assistant sends about 8k context everytime with all the devices, instruction prompt, tool calling etc. When doing this with a mac mini m4 that translates to 10-16 sec delay to turn off one bloody light because mac mini does not have enough memory bandwidth to get to first token quickly. If I was just talking to it normally I’d get quick reply with 20 tokens a second on a 7-8B model.

For context a mac mini m4 only has about 100-200GB/s memory bandwidth. A 3090 or above has at least 900GB/s memory bandwidth, that is what you’ll need it you want a quick response

Personally I just stuff home assistant in a VM on my server, a cheap Nvidia v100 with ECC memory is like 300 bucks on eBay, about same memory bandwidth as a 3090.
[/quote]

Mac mini.

I’m running granite 3/4 and llama2…

Is it really 15-20 seconds on M4? Which models is this based on? Thanks!

It’s ALL ABOUT VRAM…

If your model doesn’t have enough perf WILL suffer. No it’ll flat out suck.

Comment above about 8k.thats MINIMUM. It goes up as you add capability and tools.

And at 8k you need pretty solidly at LEAST 12G of Vram. I need 12K just for tools and home context… And vram requirements go up exponentially. I need 16G to load gptoss20b with enough context to be useful

Quantized models get you farther in low vram situations but at some point the model becomes too stupid tk do anything.

So not just ram… Dedicated vram (or dedicated ram from shared pool on machines using a flat memory model and as much as you can reasonably afford.

The difference is striking… All in vram. Subsecond response… Send in too much initial context and start dropping out of GPU vram and it becomes minutes because the card can’t keep up with memory demands.