My Journey to a reliable and enjoyable locally hosted voice assistant

Well - the good news is, crzynik's PR to add chat_template_kwargs support for AI Tasks is part of the latest Local OpenAI LLM integration release. I've pulled the update and it does exactly what it says. Thanks!

The moderate-to-bad news is, I now think I've narrowed down the actual issue with Gemma 4 here. (Maybe this is also impacting other models - I haven't really branched out because of my setup. I can only definitively say what I'm seeing on llama-server + unsloth/gemma-4-E4B-it-Q4_K_M.)

The problem is: Enabling "Structured output" on the request seems to cause the server to bypass or ignore the enable_thinking: false flag, whether it's set via chat_template_kwargs or via --reasoning off in the server config. If I disable Structured Output, it respects the no-thinking flag; if I enable it, it thinks for as long as it wants, and only --reasoning-budget on the server config can stop it.

Bit of a setback, since my current candidates for "fast" and "slow" tasks both really benefit from the structured output requirement, and the slow one definitely underperforms if I don't allow it to reason first. Might have to see if I can work around it by just prompting the "fast" task really hard. :man_shrugging:

I’m not quite sure what you are referring to here, what is this and why are you needing to use it?

Sure. In a Home Assistant script or automation, you can add a "Generate data" action that invokes an AI Task entity and returns the output. In addition to the prompt telling the AI what to generate, you can toggle on the "Structured Data" setting and include a Home Assistant-style YAML description of what exact output fields you expect.


In this example, the prompt is giving the LLM a spoken list of items and the spoken name of a to-do list, along with a list of all of the possible to-do list entities it might match, and telling the LLM to reformat and clean the item list to a particular standard, and identify the most likely entity match.

Enabling "structured output" on the request engages some mechanisms on the server that coerce the model into only outputting syntactically-sound results, e.g. only valid JSON (I think llama.cpp's implementation is called GBNF?) Most likely the model itself is generating JSON and Home Assistant is just translating it to/from YAML. But the result is that the data returned to Home Assistant from the "Generate data" task is a variable with typed keys and values that you can use just like any other automation variable.

I see, I now remember seeing that but haven’t used it. That might be a bug in the code where perhaps it is not sending both, I will take a look.

You can try enabling debug logs so we can confirm, but it looks like everything should be working on Frigate side. Perhaps the llama-server is ignoring the kwargs when that is set.

Yeah, I suspect this is probably an issue on the llama-server side. I could definitely see that the kwargs were getting sent in both cases. I doubt I'll have enough time to dig into it much further right now. Thanks again for all of your help so far!

When I tested Qwen3 ASR originally I kept getting language None<asr_text> and it just kept seeming to work oddly. I tried again today and it looks like that needs to be parsed out on the user side, but running on saved .wav that i have Qwen3ASR 1.7B is doing even better than Gemma4 E4B.

1 Like

Can I ask a question on the hardware end here? If I wanted to run local, is there anything fast enough in the $500 range? I see that @NathanCu mentioned he has a 5070ti, but it looks like that's going for ~$800 on Ebay. Is that worth it?

My workload would likely be (low latency) Home Assistant voice and maybe Frigate (although it runs just fine on OpenVINO, I think). I'm also running Hermes right now, so it would be great if I could add that locally too.

1 Like

Realistically newer than a 3000 series. For LLM yojr biggest impact is VRAM. So at least 3000 series with 12G vram see anything in that range?

What are you running hermes on. That takes a SIGNIFICANT model.

I documented a number of different GPUs above which I tested my setup on. It really just depends what model you want to run, even a lower cost 8GB GPU can do modest tasks effectively these days, but the more complex tasks may fall short or not be reliable.

You can get an AMD RX 9060XT 16GB for $400 these days which will run very well on llama.cpp using the Vulkan backend.

3 Likes

I have a Tesla P100 16GB lying around unused and a Quadro P4000 8GB, im guessing these are a bit slow to be bothered with?

I had my LLM do some research and here is what it found:

Specification NVIDIA Tesla P100 NVIDIA Quadro P4000
Architecture Pascal (nvidia.com) Pascal (notebookcheck.net)
VRAM Capacity 16 GB HBM2 (nvidia.com) 8 GB GDDR5 (notebookcheck.net)
Memory Bandwidth ~732 GB/s (nvidia.com) ~192 GB/s (notebookcheck.net)
LLM Suitability Excellent for 7B–14B models Limited to small 7B models (quantized)

The P100 memory bandwidth looks quite good on paper, but the tokens/s is lower than expected as it likely has some bottlenecks. That said, I would definitely try running GPT-OSS:20B on it as it looks like performance should still be plenty usable, at least enough to give it a try. You'll definitely want to look at my tips above to make sure you're fully taking advantage of prompt processing.

I was getting that as well. I do find Gemma4 to be really good though with your integration

1 Like

did you ever solve the "language None<asr_text>" thing? i worry it would interfere with the native HA assist fallback to avoid calling the llm

According to documentation it's meant to be stripped out. The stt integration I created and linked in the OP handles this

\

I build a dual gpu machine thinking I would offload the transcoding for frigate on the machine. Unfortunately, the idle power draw jumped 100w with the GPUs. Granted, a single GPU may only be 50w. Instead, I pulled an unused AMD 6600H mini pc that uses less than 50W running full tilt. With the number of cameras, it was also using about 1/4 of my vram. I still plan on piping some screenshots or video or something to my llm, but I was disappointed trying to double dip.

This can easily be improved by lowering the max clock speed and max memory speed on the GPU. When I used to run on a 3070 for Frigate this took the power usage from 50w down to 25w.

And that is not exclusive to ML / decode. It applies to most LLM workloads too. I reduce the power limit of my 9060XT which runs image generation from 250w down to 175w and it is actually faster with the lower power limit likely due to less throttling.

2 Likes

i'm having this issue, both with home control and assist, both ollama and "openai local llm", where it tries to "turn on" automations to trigger their actions, and it doesn't work. did anyone solve this or do i just have to move everything to scripts :\

I haven't tried that, I would wonder if it is actually supported, maybe something just not working correctly in Home Assistant Core.

Worst case you can just make a script that runs the automation, and then pass the script. Not ideal but easier than migrating all the logic.

I’d have to see a log of the conversation to understand what it thinks it’s supposed to do to even pretend to answer.