I’ve got HomeAssistant installed on a powerful PC using the x64 image.
I would like to make the voice recognition and answers more accurate and extensive, so I would like to integrate a LLM (Like ollama for instance)
I have seen all kinds of tutorials on how to connect an LLM running on some server in your LAN, but that requires yet another server to be powered on all day…
My question:
Is it possible to actually install and run the LLM on the server that is running HomeAssistant (x64), and if so, are there any tutorials for this???
Yes it is. Now it can depend on your setup but if you are using docker container or such monstrosity just put it in container and run it.
But ollama does use a lot of gpu and cpu so count on that. That also depends on model you are using but generally it does use a lot.
As It can crash your server people are putting them on another comp.
If the machine has the beef… Then you can make it run Proxmox and punch your GPU through to a VM running ollama (the ‘separate’ server) and HA on a separate VM calling the ollama instance.
Just because it says separate machine it doesn’t mean separate iron because virtualized environment.
I’ve got ollama installed on a Linux Mint installation, running two GPU’s (5070TI and 3060TI, although only the 5070 gets used) with a AMD Ryzen 9 9900x
That machine however, as you can imagine, is quite hungry in the kWh department and is switched off when not in use…
I know GPU would be the best for speed, but HA is installed on a MiniPC with an AMD Ryzen 5 7430U, 16GB DDR4 RAM, 2TB Samsung Pro 990
Is it totally impossible to run without GPU and be usable (I mean, I ran DeepSeek from my Samsung Android phone: Wasn’t fast but is was running)
UNEQUIVOCALY YES. <<< Like Commander Scott, I cannot change the laws of physics, captain… Yes. (Physics in this case is TONS of tensor math - like a game engine…)
MUST have a GPU capable of maintaining 8k context or better (practical translation 12G vram)
You can’t LLM on CPU- sorry. It flat out wont work and for HA control - it needs to be a decent sized LLM. I know people don’t want to hear it but you can try - you can fail, and get upset about it and say it doesn’t work. But in reality it needs beef.
What param count, what quantization, what context length?
LLM != LLM != LLM (one is not like the other)
For decent response (meaning measured in seconds not eons) that 3090 is ACTUALLY better than you 5070 (more VRAM, bigger smarter context, yes it matters - more vram every time within two card generations…)
You could probably load a 24-32k context window in that thing with decent performance. You’re not setup much different than I am with Friday. Note I STILL cannot load her completely local and rely on a cloud model for the front stage persona but drive an Intel a770 and a 5070ti on a NUC.
Im on the long ramp to publicly deployable right now. Goal is llm framework 100% INSIDE ha native without addons. Provides the glue to solve the grandma’s cardboard box problem. (see posts apologies for typos)