Local LLM hardware

mishoboss · January 24, 2025, 8:44pm

Hi community,

I have the HA Voice Preview Edition with Whisper STT running on Intel N5105 with very poor performance. Basically it’s pretty much unusable. So I’m thinking to upgrade my HA server to something a bit more decent that can run a local LLM as well. I want the full Jarvis experience.

Problem is, I don’t want to pay the electricity bills for it. I see there are a few Copilot+ NUCs recently introduced (mainly around CES), like for example the ASUS NUC 14 Pro AI. Basically my question here is, is this gonna be enough for the experience I want to achieve?

What do you use? Are you happy with it? What would you recommend? Anything would be of help here, I’m literally at ground zero. And I believe a lot of people are at this point as well now, so it would be a potentially interesting thread.

NathanCu · January 24, 2025, 8:58pm

A NUC pro14 ai runs approximately 100 TOPS. Its split between GPU NPU and CPU processing

For context…

That’s BARELY enough to run Llama3:8b (Llama38b models are typical for local ai running HA… I can currently run it successfully on a Nuc14 AI. but with challenges) if you jump through hoops for the Intel ARC architecture based installation (if you don’t fo this often most everyone uses Nvidia iron so Intel is an odd setup using Intel Ipex and more complex. Doable but complex.)

So for full Jarvis. You’re just BARELY on the leading edge of the capable hardware doable but will require care and feeding.

mishoboss · January 24, 2025, 9:04pm

Thanks for the quick reply, @NathanCu. What kind of challenges, “care and feeding” we are speaking about here? I understand such kind of horse power just touches the surface, but at the end, if the next best thing is a $3K GPU setup, and “divorce level” electricity bills, I’ll take it.

NathanCu · January 24, 2025, 9:18pm

Short version m I have to drastically cut back my expectations on what I can make it do.

Biggest problem… To be successful I had to get into limiting number of entities I can control and limiting how much context I can stuff into the prompt. Torre also working on a very limited chat context window. I may be able to do something about that part with settings but… In short if you have a large complex install local with that hardware will be difficult.

PlayedIn · January 25, 2025, 5:11am

I don’t know much about the topic, but isn’t Whisper STT an ASR which requires less computational power than an LLM?

AshaiRey · January 25, 2025, 4:02pm

A few number here.
I run a small LLM on a dedicated pc in a 12 Gb vram GPU. I can get acceptable responses witin a 3-9 seconds time frame
The pc draws 54W idle but peaks to 250-300W when processing a request.

Does it work? Yes.
Now for me that idle power consumption is far to high fir a few requests a day and I am looking/waiting for a better solution. Peak power is just a second of 15 so that doesn’t worry me much.

mchk · January 25, 2025, 6:28pm

Some more information about gpu power consumption. Сhart for Windows. On the server, without GUI, you can expect slightly lower values.

NathanCu · January 25, 2025, 6:52pm

Yes it is. It does nothing more than take voice and convert it to text. That’s the light lift here. Well as long as you have enough horsepower to do it it’s lighter than the llm requirements.

Then you have two choices you can let the HA voice intent system take a stab at trying to match the request. (the part op is disappointed with, and I agree it’s pattern matching and it’s cumbersome)

Or you can hand that text to an LLM to interpret and then do the things. It’s what an LLM does well… If you can give it enough context and enough horsepower.

Jokerigno · February 12, 2025, 10:40am

Does the NUC pro 14 will increase STT as well? Whisper is SO disappointing in my native language.

NathanCu · February 12, 2025, 2:01pm

If i were building a machine for hardware acceleratied ai use RN I would use Nvidia based hardware. Why. Because yes it’s a cheap miniPC but you’ll spend most of your time getting it to do your work because the Intel specific parts of this still need (a lot of) work while Nvidia being the big dog just works.

My rig is totally experimental because I do this sort of thing for a paycheck. The Intel gear requires a LOT of extra work to get the AI acceleration through the Intel IPEX stack and an Alchemist or Battlemage. It totally WORKS but unless you’re a tinkerer or Intel makes ipex WAY easier I wouldn’t recommend it right now. Everything is built for Nvidia CUDA architecture at the moment. Yes whisper/piper is not this but let’s face it if you buy the box. You’re going to try Ai here too.

Ultimately I plan on running the local ai stack on Nvidia Digits hardware

So tl:dr, unless you’re a pro, or otherwise have a reason to id stick to the Nvidia stack for AI stuff right now.

Jokerigno · February 12, 2025, 3:30pm

well 3k for a NVIDIA DIGITS could be not affordable…spending 20-40 dollars in ChatGPT plus will be cheaper for more than 6 years…(without considering the bill)

NathanCu · February 13, 2025, 12:48am

This is very true, and…

I then can run digits flat out continually processing EVERYTHING whereas I’m being VERY stingy with my tokens at present time.

The operative here being I think as far as usefulness for me a machine capable of running a 100b model not quantized it probably as close to ‘Jarvis’ as anything I’ve ever expected to achieve. To do that with tokens now will cost me well over $100/month.

With local capable hardware I don’t care anymore and a lot of things that I used to write software for I would send EVERYTHING to the ai to analyze on background and have the interactive agent just summarizing those states. Hard to describe but it changes the usage pattern for a very different workload. More churn and burn than transactional.

Right now ChatGPT and cloud model economy favors how much work can I do in this relatively small budget per time scale. Local (even expensive) favors push as much through the pipe as fast and hard as you can all the time. Sublet sure? You want ai? Yeah I need this proc at 80% all the time or I overpaid.

In my current plan it’ll be at 80% or better and I may need the feature for tethering to a second one (I really hope to avoid it.)

I think long term it’s really something like an HX370 based minipc + digits gets a lot of heavy long term use through… 2030ish? Basically $3000-3500USD depending on your best deal at target time frame for probably 5 years of use? Heck yes. Sign me up. Mayyyybe a booster to the minipc at 3 years for another 500 bucks…

I’ve seen people pay that on just the watercooling and I GUARANTEE I’ll pump way more tokens through than I could pay for in the same time period with oAI.

There’s already enough ML pattern data from companies like OTIS to be used as prediction of failure for resistance workloads.

Suddenly you feed in that ML model data and your electrical usage data and tell the LLM that it can get all that with these three tags… Run It hourly…

Id blown the doors off my tokens at oai. But running that locally. It’ll tell me the refrigerator is about to fail months before anyone would know anything is wrong…

spudje · April 4, 2025, 9:37am

So if that Asus NUC 14 Pro AI is not sufficient, what would be an off the shelf (small sized/silent) PC that is suitable to run AI for HASS voice assistant locally (And HASS itself, with another bunch of addons).

I really don’t have the time anymore to gather components and build one myself.

Thanks!

daywalker03 · April 4, 2025, 10:40am

I would suggest finding a mini PC with at least a 3060 with 12GB VRAM, if possible.

mchk · April 4, 2025, 11:50am

Check out the mac mini reviews. Different versions of soc give different performance, but even the standard m4 shows tolerable performance for small (7B-14B) models.
These products probably have the best idle power consumption to date.

NathanCu · April 4, 2025, 1:45pm

More. As much VRAM as you can afford. (This goes in all cases)

It’s not that simple - there is no 'One size fits all just buy it and put it in place right now - doesn’t exist. Lets start there.

The Nvidia DIGITS - when you can get it, will be the closest thing to that that exists.

First you have to consider workload.

(run this on a shoe box)
2-4b models: TINY - good for a simple yes /no answer - fast and responsive but not doing a lot. Runs on just about anything. Voice processing basically fits in this space too. You might be able to brute force this if you got enough CPU or a moderate GPU makes this cake.

(sweet spot for the IPEX here)
7-8b models: Workhorse, without a personality and without ambiguous tasks this can handle MOST of the summarization and JSON generation tasks you throw at it. Various models here do different thigs here - I plan on loading a selection of models to be able to handle different tasks. Honestly MOST real home automation tasks will probably live in this space EXCEPT…

(you really need bigger iron than the ipex here)
16-32b models: Heavy hitter - either big number crunchy or - you need a (believable) personality.

I can do everything up to the 16G on the NUC with Intel-IPEX Ollama.

Once you start needing a 20+ GB model things start to be memory constrained and swap happens - slowing everything down. With more RAM it’s doable - the box has over 100 TOPS. But right now I’m RAM constrained.

So what I’m doing with Friday this week is splitting out the summarizer to be able to run small bits repeatedly, very quickly - and direct THOSE to the local model.

Interactive is still GPT but now it’s ONLY doing the interactive prompt and model use drops from. 150$/mo to $5

daywalker03 · April 4, 2025, 4:33pm

A single RTX 3060 with 12GB VRAM is the minimum I would suggest; Quad 3090s with 24GB VRAM each (96 total ) would be better, but tricky to assemble.

tmjpugh · April 4, 2025, 5:04pm

any hope for building midrange under $400USD in the future?

NathanCu · April 4, 2025, 5:06pm

The GPU alone will cost that.

tmjpugh · April 4, 2025, 5:10pm

Thats why I ask.
Wondering if this is the long term plan – brute force – or if anything is in the works that will be both capable and less expensive.