Well - the good news is, crzynik's PR to add chat_template_kwargs support for AI Tasks is part of the latest Local OpenAI LLM integration release. I've pulled the update and it does exactly what it says. Thanks!
The moderate-to-bad news is, I now think I've narrowed down the actual issue with Gemma 4 here. (Maybe this is also impacting other models - I haven't really branched out because of my setup. I can only definitively say what I'm seeing on llama-server + unsloth/gemma-4-E4B-it-Q4_K_M.)
The problem is: Enabling "Structured output" on the request seems to cause the server to bypass or ignore the enable_thinking: false flag, whether it's set via chat_template_kwargs or via --reasoning off in the server config. If I disable Structured Output, it respects the no-thinking flag; if I enable it, it thinks for as long as it wants, and only --reasoning-budget on the server config can stop it.
Bit of a setback, since my current candidates for "fast" and "slow" tasks both really benefit from the structured output requirement, and the slow one definitely underperforms if I don't allow it to reason first. Might have to see if I can work around it by just prompting the "fast" task really hard.
Sure. In a Home Assistant script or automation, you can add a "Generate data" action that invokes an AI Task entity and returns the output. In addition to the prompt telling the AI what to generate, you can toggle on the "Structured Data" setting and include a Home Assistant-style YAML description of what exact output fields you expect.
In this example, the prompt is giving the LLM a spoken list of items and the spoken name of a to-do list, along with a list of all of the possible to-do list entities it might match, and telling the LLM to reformat and clean the item list to a particular standard, and identify the most likely entity match.
Enabling "structured output" on the request engages some mechanisms on the server that coerce the model into only outputting syntactically-sound results, e.g. only valid JSON (I think llama.cpp's implementation is called GBNF?) Most likely the model itself is generating JSON and Home Assistant is just translating it to/from YAML. But the result is that the data returned to Home Assistant from the "Generate data" task is a variable with typed keys and values that you can use just like any other automation variable.
You can try enabling debug logs so we can confirm, but it looks like everything should be working on Frigate side. Perhaps the llama-server is ignoring the kwargs when that is set.
Yeah, I suspect this is probably an issue on the llama-server side. I could definitely see that the kwargs were getting sent in both cases. I doubt I'll have enough time to dig into it much further right now. Thanks again for all of your help so far!
When I tested Qwen3 ASR originally I kept getting language None<asr_text> and it just kept seeming to work oddly. I tried again today and it looks like that needs to be parsed out on the user side, but running on saved .wav that i have Qwen3ASR 1.7B is doing even better than Gemma4 E4B.
Can I ask a question on the hardware end here? If I wanted to run local, is there anything fast enough in the $500 range? I see that @NathanCu mentioned he has a 5070ti, but it looks like that's going for ~$800 on Ebay. Is that worth it?
My workload would likely be (low latency) Home Assistant voice and maybe Frigate (although it runs just fine on OpenVINO, I think). I'm also running Hermes right now, so it would be great if I could add that locally too.
I documented a number of different GPUs above which I tested my setup on. It really just depends what model you want to run, even a lower cost 8GB GPU can do modest tasks effectively these days, but the more complex tasks may fall short or not be reliable.
You can get an AMD RX 9060XT 16GB for $400 these days which will run very well on llama.cpp using the Vulkan backend.
The P100 memory bandwidth looks quite good on paper, but the tokens/s is lower than expected as it likely has some bottlenecks. That said, I would definitely try running GPT-OSS:20B on it as it looks like performance should still be plenty usable, at least enough to give it a try. You'll definitely want to look at my tips above to make sure you're fully taking advantage of prompt processing.
I build a dual gpu machine thinking I would offload the transcoding for frigate on the machine. Unfortunately, the idle power draw jumped 100w with the GPUs. Granted, a single GPU may only be 50w. Instead, I pulled an unused AMD 6600H mini pc that uses less than 50W running full tilt. With the number of cameras, it was also using about 1/4 of my vram. I still plan on piping some screenshots or video or something to my llm, but I was disappointed trying to double dip.
This can easily be improved by lowering the max clock speed and max memory speed on the GPU. When I used to run on a 3070 for Frigate this took the power usage from 50w down to 25w.
And that is not exclusive to ML / decode. It applies to most LLM workloads too. I reduce the power limit of my 9060XT which runs image generation from 250w down to 175w and it is actually faster with the lower power limit likely due to less throttling.
i'm having this issue, both with home control and assist, both ollama and "openai local llm", where it tries to "turn on" automations to trigger their actions, and it doesn't work. did anyone solve this or do i just have to move everything to scripts :\