I have been trying Gemma 4, both the E4B and 26B, and both with good result. The 26B a little better. But I had very different experience with the unsloth Gemmas. So I’m using bartowski now.
Maybe things has changed recently though, I have not tested them lately since I can’t fit them + whisper when I’m coding with Qwen 122B A10B, which is a problem, to say the least. I hope I can fit the E4B when I have the time, if I squeeze the memory setup a little. If I can, then I’ll script it, so 26B will load when I’m using smaller models (Qwen Coder Next 80B etc), and then E4B when needed.
Does anyone know why my agent is ignoring the “Process commands locally” setting?
I am asking the agent questions that should be processed perfectly by the built-in Assist intent, but my LLM agent is not processing them, and is instead spinning for 30+ seconds only to give me a verbose and incorrect response.
Debug shows it didn’t even try to process locally at all, even though it clearly says “prefer handling locally.”
Meanwhile, I created another agent that didn’t have an LLM attached and it processed the exact same command correctly in 0.08 seconds.
Is it feasible to run a local AI on just a CPU?
If not, how well would a NVIDIA Quadro K620 do, or a Coral TPU? Yeah, I know, it’s old
Any reccomendations as for which local model is best for fast responses, primarily for voice assist? I already have Whisper + Piper + OWW/MWW set up & trained.
Has anyone experienced issues with reaching the limits of the Google Gemma models? They appear to have helpfully loose rate limits.
These are already answered in the thread many times.
In the alternate to a GPU… No. Also no, and also no.
You need a GPU/NPU/XPU with enough vram to run your chosen model and context. Most homes will require a context between 8-12 kb which roughly translates to at LEAST 8G but probably more VRAM on that card. So… Current GPU (last couple years, probably a NV 3000 or better)
Qwen3.5/3.6 and Gemma4 are the current king of the local space. Newer models are actually performant at 4b quantization levels. Your chosen model must support tool use and should support ‘reasoning’
Everyone hits limits. I personally do not recommend free cloud models - just think about what info you are sharing.
Probably worth a bug report for core. For what it’s worth I have a PR up that adds filterability to the GetLiveContext tool so with any luck that will be merged soon and your LLM response will be much faster and hopefully better
I have started (not too long ago) with Ollama and HA. But now I want to improve things and get voice going too (eventually).
So the first step for me is to move away from Ollama to llama.cpp .
Got the docker going (using a 3090) and I have llama.cpp running. However I am not sure if my config.ini is correct/optimized for my 3090 and for gemma-4-26B-A4B-it-GGUF:Q4_K_M and I need some help please. This is how it is right now:
Please let me know the changes to the config.ini (if any) you would recommend for a 3090 and gemma-4-26B-A4B-it-GGUF:Q4_K_M model so I can get the best performance out of them.
So, ok that model you chose will chew probably ~16G of it at 128k context.
Thats a decently performant model and is quite good. And… there’s some considerations.
You may want to look and see if you can run the E4B instead. Depending on the work.
That model you have is equivalent to the one I run on my DGX spark for coding. Great for coding. Not the fastest but great.
I don’t see anything setting context (there it is missed it first time ignore next but still 256k default) that model defaults to 256k context so unless you told it otherwise it’s going to fill as much space on that card as it can with kv cache.
These conspire against you to build a capable big honkin context model on that card that… let’s just say won’t be the fastest. (my bet is you’re probably at about 6-12 seconds to first token (TTFT) (based in that 120000)and somewhere around 5 tok/s.
If your coding again, not an issue.
But for a voice response endpoint 12-24 seconds is agonizing. First lever. Get a smaller model.
The E4B is VERY good for its size and significantly reduces memory load. You don’t need coder smarts for a Frontline tool user. You can call out to that if you need it and by reducing the me load on the card you get faster response. Everything else being equal I find E4B is faster than it’s bigger brother, even quantized down (the KQ4_whatever part of that model). (basically you don’t need the biggest model for HA, you need the ‘right’ one, the right one for HA is one built for ‘Agentic tool use’ like Qwen3.5 or Gemma4, you’re right there, not so sure you have the right size yet…
Then once you pick the model, make the context size (CTX param) as small as your largest workload needs. Because, while long context means better connections and ‘smarter’ it also logarithmically increases your memory use (key value cache / KV cache) and TTFT. Which directly == slower.
If mine I’d identify how big of a context I really needed (you can look at the server and it will tell how big your request was) and setup my inference service to support that + a reasonable growth number to get speed as fast as possible.
Thanks for your reply. I am using this model with LLM Vision too so it can analyze pictures/live feeds from my cameras. I will have to see if E4B can handle that too.
In regards to the config.ini - as I am really a beginner at this and no much idea what to put in there and how to go about it, I took whatever @crzynik put in his and hoped for the best
I have found something interesting, I believe this has come as part of the 2026.5 update but it is difficult to say for sure.
It appears that now, device control does not require the LLM to provide an area as it seems to now pull the area directly from the satellite. For example, this worked:
- tool_name: home-control__HassMediaSearchAndPlay
tool_args:
search_query: jazz music
id: IaFGGMjywReWibQ6f3lN8UTKXeR733YD
external: false
Where as previously this would give me a failure IntentMatchError or something like that. This makes me think that there is some possibility that the area clause could be removed making the prompt not be dynamic per speaker anymore. I am testing this now with the override functionality in llm_intents customize_assist branch and will see how it goes.
edit: looks like there is still some difficulty with basic device control as there is no generic way to just target the current area for those, it is mainly other commands like HassSearchAndPlay which don’t need an area
Just wanted to say thanks to crzynik and everyone else contributing to this thread (and crzynik, thanks in particular for regularly updating the top-level post!) By referencing the material here, I was able to get my handful of Voice units running fully locally, with Whisper/Piper running on the Home Assistant box itself (an old laptop), and a separate machine hosting a quantized Gemma 4 E4B on an old Nvidia RTX 2060 - taking up a mere ~5 of the card's 6 GB VRAM.
The speed is totally reasonable - once the cache is warmed up, most queries only take ~2 seconds of Whisper processing and 2-3 seconds of inferencing before I get a response back. It's not the brightest model, but it seems plenty capable for basic work so far, and it's genuinely a little magical to show off interactions like this:
Hey Jarvis, is there any fruit on the shopping list?
Yes, there are bananas on the shopping list.
Hey Jarvis, what about vegetables?
No, there are no vegetables on the shopping list.
Hey Jarvis, go ahead and throw broccoli on there.
I've added broccoli to the shopping list.
I've even turned off "Prefer handling commands locally" because the Gemma4 handler is better at handling multi-item shopping list requests like "Add bread, milk, and apples to Shopping". This was a QoL feature we missed after pulling the plug on Alexa - the native Home Assistant intent handler will add a single "bread milk and apples" item to the list, but (at least with a little bit of prompt coaching) the LLM will add them each individually. (Home Assistant's ability to route to different assist handlers based on wake word is also lovely here - we still have Okay Nabu if we ever need to fall back to the basic Home Assistant Cloud offerings.)
Still playing around to find the rough edges - e.g. it needed a reminder to only count light entities and ignore switches when I ask it to turn off the lights - but it's really something else. Would not have guessed even a couple of months ago that I could get something this capable running on hardware that limited.
In the near term, I might either try to rig up a way to reuse the same Gemma4 model for voice processing instead of running on Whisper, or finally migrate Home Assistant to another box (something that's been in the "todo" column for months now) and see if either of those can speed up the voice processing.
I'm personally eyeballing vllm-omni with Qwen3TTS. I just need a build of it that works on ARM for my Spark. Probably should be able to make it go on CUDA for your card.
Tell you what the E4B model is prety dam good. had been running the A4B before this. but for my use case, the E4B is nearly perfect, does tool calls as expected and is fast.
@crzynik - thanks for example. I have customized and now have it running in HA. Test with LLM Vision and I am happy with performance. I might try E4B at a later stage and more llama.cpp customizations but for now I am happy.
Next is getting the voice stuff done.
STT - looking at Wyoming ONNX ASR. What model are you using with this (hopefully I can make it fit on the GPU too).
TTS - I think I will be using hass_local_openai_stt with Gemma4. Or is Kokoro better?