My Journey to a reliable and enjoyable locally hosted voice assistant

More so the size of the model is the full model or quantized/compressed version?

1 Like

I specified the quants in the post above

2 Likes

thanks didnt see this

I’ve set up my pipeline according to the opening post. I’m using gpt-oss-20b-MXFP4 and everything seems to work great (adn fast on my 3090), except the response format.

I can’t seem to figure out how to get rid of the harmony part. Anyone have a hint for me?

What model provider are you using and what is your config?

Model is GGML GPT-OSS:20B MXFP4, as suggested in OP.

I’m using local.ai as my backend (with llama.cpp as actual backend). Config there is only setting the context size. I added the completion template there and in local.ai’s chat it got rid of the harmony part, but in HA I havent managed that yet.

Config in HA is the system prompt in OP, for now only with the addition of answering in dutch/flemish.
I tried adding the same chat template in HA in the chate template arguments, but that didn’t have any effect.


I don’t know with local.ai how you can set things. You likely want to set --jinja so it uses that format. You might be able to set that in the chat arguments too, need to check the api format for that.

local.ai’s config for the model looks like this:

backend: llama-cpp
description: Imported from huggingface://ggml-org/gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf
function:
    grammar:
        disable: true
known_usecases:
    - chat
name: gpt-oss-20b-mxfp4.gguf
options:
    - use_jinja:true
parameters:
    model: gpt-oss-20b-mxfp4.gguf
template:
    use_tokenizer_template: true
    completion: "<|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant"
context_size: 98304

not sure if it is literal or not, I have always just seen it as --jinja llama.cpp/tools/server/README.md at 4d828bd1ab52773ba9570cc008cf209eb4a8b2f5 · ggml-org/llama.cpp · GitHub

you might also want to double check the chat template compared to what the model uses, I pull my model from huggingface directly in llama.cpp

1 Like

@cryznik - what openai integration have you settled with? Currently using local ai but wondering if to try out one of the forks of extended openai conversation…

I’ve also gotten Qwen 3.5 up and running (Q3_K_XL unsloth variant) - early days but seems to be working reasonably well thus far and the extra vram availability for larger context + parakeet / kokoro is helpful - I tried qwen 3 asr also but despite the headline latency benchmarks was surprisingly underwhelmed also!

Here is my initial config for llama.cpp (running on an rtx 4090):

docker run --name Pandora-Brain \
  --gpus '"device=0"' \
  -p 9900:9900 \
  -v ./llama.cpp/models:/models \
  -v ./llama.cpp/models/templates:/templates \
  local/llama.cpp:server-cuda \
  -m /models/unsloth/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf \
  --alias "Pandora" \
  -c 16384 \
  -n 1024 \
  -b 1024 \
  -e \
  -ngl 99 \
  --chat_template_kwargs '{"enable_thinking":false}' \
  --jinja \
  --mmproj /models/unsloth/Qwen3.5-35B-A3B-GGUF/mmproj-BF16.gguf \
  --chat-template-file /templates/Qwen35.jinja \
  --parallel 1 \
  --port 9900 \
  --host 0.0.0.0 \
  --flash-attn on \
  --top-k 20 \
  --top-p 0.8 \
  --temp 0.7 \
  --min-p 0 \
  --presence_penalty 1.5 \
  --repeat_penalty 1.0 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --rope-scaling linear

I use GitHub - skye-harris/hass_local_openai_llm: Home Assistant LLM integration for local OpenAI-compatible services (llamacpp, vllm, etc) · GitHub as it stays true to the way assist works while providing some helpful features such as the date / time injection

1 Like

Thanks - am using the same then. Just was slightly uncertain if it may be overwriting some of the arguments per my docker configuration above (e.g. temperature appears as an option in the integration config)

The only one it writes is temp, which is just generally because you may want a different temp than other user cases.