Best hardware to run voice locally?

I’m currently running Home Assistant on a Blue I bought a few years ago, and I’m using Alexa for Voice. I want to switch to Home Assistant Voice and run everything local. This will be a dedicated system with no Docker or VM usage. What is a good prebuilt system to get good performance with voice without being overkill?

Hello morikaweb,

If you are going to locally generate TTS, STT, and run an LLM locally, there is no such thing as going overkill…
You will need all the GPU and CPU’s help you can afford.

3 Likes

Thanks, that’s kinda what I thought. I suppose my real question is what companies are good?

I am looking at something like a HP Elite Desk or a Beelink or Minisfourm system but I’m not sure what company is more reliable. Also what is better for this job, Intel or AMD? I’m very new to this Voice/LLM stuff so any advice is appreciated.

Hi @morikaweb, it really depends on what you want. If you want to run serious LLMs locally, you’ll need a powerful GPU like the 5090 or a workstation GPU. However, if you prefer to rely on commercial LLMs like ChatGPT, there are two options:

  1. Run the voice pipeline on an iGPU – Go with the Intel N100. Most people would agree that the N100 is the most cost-effective option for running Whisper and Piper locally. 16GB of RAM will be enough for large models in Whisper/Piper.
  2. Run the voice pipeline on an NVIDIA GPU – this provides the best performance with blazing-fast STT and TTS. It also depends on the model size, but 8GB of VRAM will be more than enough for most models. However, you’ll need to build your own PC, which can be costly and power-hungry.
3 Likes

Thanks again, based on this and my research I have decided to start with a Beelink EQ14. If I switch to a higher spec system down the road the Beelink will be a great work PC.

I also know the LAN sucks, but I can fix that with a dongle. So I just wanted to confer that these specs should be good?

Beelink EQ14

Would the Beelink Mini PC, Mini S12 Pro Intel 12th N100 be suitable for this?

What sucks about the LAN?

No.

Before any of you go any further, have a look at this:

There is no commercial option. As @sayanova says, to get local TTS you need a high-end graphics card and you’ll probably have to buy a gaming case to put it in. The results are great - comparable to Alexa - but it’s a lot of work and very expensive.

I’d have to agree you need a good GPU for ollama setup on home assistant. I’ve setup ollama on a Mac mini m4, the problem you are going to run into is home assistant sends about 8k context everytime with all the devices, instruction prompt, tool calling etc. When doing this with a mac mini m4 that translates to 10-16 sec delay to turn off one bloody light because mac mini does not have enough memory bandwidth to get to first token quickly. If I was just talking to it normally I’d get quick reply with 20 tokens a second on a 7-8B model.

For context a mac mini m4 only has about 100-200GB/s memory bandwidth. A 3090 or above has at least 900GB/s memory bandwidth, that is what you’ll need it you want a quick response

Personally I just stuff home assistant in a VM on my server, a cheap Nvidia v100 with ECC memory is like 300 bucks on eBay, about same memory bandwidth as a 3090.

This is a very helpful comment!

I have been trying to figure out why ollama webUI and SSH is real time response on my old Intel MacBook (basic LLM models) but is taking 2-3 minutes to respond via home assistant assist.

I thought the solution might be a 16gb unified memory M4 [quote=“syleishere, post:9, topic:849340, full:true”]
I’d have to agree you need a good GPU for ollama setup on home assistant. I’ve setup ollama on a Mac mini m4, the problem you are going to run into is home assistant sends about 8k context everytime with all the devices, instruction prompt, tool calling etc. When doing this with a mac mini m4 that translates to 10-16 sec delay to turn off one bloody light because mac mini does not have enough memory bandwidth to get to first token quickly. If I was just talking to it normally I’d get quick reply with 20 tokens a second on a 7-8B model.

For context a mac mini m4 only has about 100-200GB/s memory bandwidth. A 3090 or above has at least 900GB/s memory bandwidth, that is what you’ll need it you want a quick response

Personally I just stuff home assistant in a VM on my server, a cheap Nvidia v100 with ECC memory is like 300 bucks on eBay, about same memory bandwidth as a 3090.
[/quote]

Mac mini.

I’m running granite 3/4 and llama2…

Is it really 15-20 seconds on M4? Which models is this based on? Thanks!

It’s ALL ABOUT VRAM…

If your model doesn’t have enough perf WILL suffer. No it’ll flat out suck.

Comment above about 8k.thats MINIMUM. It goes up as you add capability and tools.

And at 8k you need pretty solidly at LEAST 12G of Vram. I need 12K just for tools and home context… And vram requirements go up exponentially. I need 16G to load gptoss20b with enough context to be useful

Quantized models get you farther in low vram situations but at some point the model becomes too stupid tk do anything.

So not just ram… Dedicated vram (or dedicated ram from shared pool on machines using a flat memory model and as much as you can reasonably afford.

The difference is striking… All in vram. Subsecond response… Send in too much initial context and start dropping out of GPU vram and it becomes minutes because the card can’t keep up with memory demands.

So I have a question about this.

I set up home assistant with wall panel app on several android POE touchpanels. I set up whisper as a quick voice test and left the app with the command (listening) open and it was able to turn my lights on and off nearly as quickly as I finished the statement. Why does it take such a massive amount of hardware to do the same with something like Ollama?

What I am trying to do is replace Alexa. I want it to be local. Basically all I use alexa for is to queue my playlists from amazon music and for voice control of lights.

All lights are ethernet to shelly devices. Audio is via ethernet to Wiiamp-pro to wired speakers. All touch panels are POE. All touch panels have microphones. I want to be able to say to the touch panel “wake command play my amazon music playlist playlist name” and it start playing. If I want individual songs then I can look those up myself and just play it directly. The rest of the commands are to turn lights on or off or dim them.

So in order to this I need a 2-3000$ PC?

For LLM yes.

Let me be VERY Clear if you want to drive your home with local LLM very much yes. You need capable gear. Capable gear involves real inference gear, meaning a real GPU / NPU with real Vram… But…

Voice != llm

Speech to phrase is designed to just answer voice response. Think Alexa pre LLM.

LLM answers a completely different issue - proactive Agentic management of your home.

so speech to text, would only need a database of commands that it would match to in order to complete that task and do so quickly, correct? So, as long as I don’t have much change in my routine or have to add many new things then I could operate this on my VM HA server and it would be instantaneous?

That is what I don’t find much support for, how to do that. Because I don’t know how to force something to open up my amazon music account, or even access it through wii-amp built in amazon music integration, in order to complete that basic command.

Near instant.

I it uses basically the same infrastructure for tooling. It’s not nearly as forgiving but it’ll get you turn on this light that light or the lock.

LLMs give human like understanding of intent and the ability for the computer to understand ambiguity and answer extended questions.

No llm: turn on the kitchen light.

Capable LLM: Friday tell me what we have in the fridge for dinner… (yes, really)

see this is why I want to abandon alexa or any assistant because of the push for AI. I don’t need that. I designed my home, as I have physically built it, to be completely autonomous. The rest is controllable by simple speech to text but what I don’t know is how to translate that into a more complex task of interacting with external accounts to complete a specific action (e.g. play my amazon music playlist “party” on wii-amp pro)

1 Like

Read about Music Assistant. They have a rather extensive list of voice intents for speech to Phrase for media.

I have, they don’t have support for amazon music

I am actually pretty dumb. But talking with you I realized that I can use desktop Ollama to help me build a database that I can use in home assistant to do what I want and then apply it to my server and settings without having to integrate AI directly.

I did some of it here and am linking to that micro tutorial WiiM product voice control through speech-to-phrase via api - Configuration / Voice Assistant - Home Assistant Community