Running LLM on HomeAssistant?

Hi,

I’ve got HomeAssistant installed on a powerful PC using the x64 image.

I would like to make the voice recognition and answers more accurate and extensive, so I would like to integrate a LLM (Like ollama for instance)

I have seen all kinds of tutorials on how to connect an LLM running on some server in your LAN, but that requires yet another server to be powered on all day…

My question:
Is it possible to actually install and run the LLM on the server that is running HomeAssistant (x64), and if so, are there any tutorials for this???

Any input is appreciated!

Thanks!

Yes it is. Now it can depend on your setup but if you are using docker container or such monstrosity just put it in container and run it.
But ollama does use a lot of gpu and cpu so count on that. That also depends on model you are using but generally it does use a lot.
As It can crash your server people are putting them on another comp.

2 Likes

If the machine has the beef… Then you can make it run Proxmox and punch your GPU through to a VM running ollama (the ‘separate’ server) and HA on a separate VM calling the ollama instance.

Just because it says separate machine it doesn’t mean separate iron because virtualized environment.

Whats the iron? (actual big gamer pc specs)

1 Like

I’ve got ollama installed on a Linux Mint installation, running two GPU’s (5070TI and 3060TI, although only the 5070 gets used) with a AMD Ryzen 9 9900x

That machine however, as you can imagine, is quite hungry in the kWh department and is switched off when not in use…

I know GPU would be the best for speed, but HA is installed on a MiniPC with an AMD Ryzen 5 7430U, 16GB DDR4 RAM, 2TB Samsung Pro 990

Is it totally impossible to run without GPU and be usable (I mean, I ran DeepSeek from my Samsung Android phone: Wasn’t fast but is was running)

UNEQUIVOCALY YES. <<< Like Commander Scott, I cannot change the laws of physics, captain… Yes. (Physics in this case is TONS of tensor math - like a game engine…)

MUST have a GPU capable of maintaining 8k context or better (practical translation 12G vram)

You can’t LLM on CPU- sorry. It flat out wont work and for HA control - it needs to be a decent sized LLM. I know people don’t want to hear it but you can try - you can fail, and get upset about it and say it doesn’t work. But in reality it needs beef. :wink:

What param count, what quantization, what context length?

LLM != LLM != LLM (one is not like the other)

For decent response (meaning measured in seconds not eons) that 3090 is ACTUALLY better than you 5070 (more VRAM, bigger smarter context, yes it matters - more vram every time within two card generations…)

You could probably load a 24-32k context window in that thing with decent performance. You’re not setup much different than I am with Friday. Note I STILL cannot load her completely local and rely on a cloud model for the front stage persona but drive an Intel a770 and a 5070ti on a NUC.

Hmmm ya, I was afraid of that…

Guess I could write an automation that boots the Dual GPU machine / Shuts it down when the LLM isn’t being queried…

Thanks for your help, appreciated!

1 Like

If you set the machine up correctly it will idle down - The NUC driving my 770 only uses 35w on idle and spins up to as much as I think 300…

Typo in the first post: 3060 not a 3090 :laughing: but it’s memory is nice to offload tasks to, whilst the 5070 is busy

What is Friday ?

That…

Im on the long ramp to publicly deployable right now. Goal is llm framework 100% INSIDE ha native without addons. Provides the glue to solve the grandma’s cardboard box problem. (see posts apologies for typos)

Hi Nathan,

Interesting, I’ll be sure to follow that!

Thanks again!