Yeah, the prompt processing speeds I have seen on shared memory machines (Halo Strix, Mac M5) leave a lot to be desired. But it is definitely nice to have options depending on priorities.
Optimizing Performance with llama.cpp
I started out by using Ollama, it is nice because it is easier to get started and most config items are handled by the API (like setting context size), however I quickly hit limitations.
llama.cpp has the following advantages that makes it easy to improve performance:
- direct control over context being split into slots
- easy ability to benchmark batch and ubatch sizes to find optimal options for performance
Below are my docker compose and config file for llama.cpp to generally show the optimizations that I make.
docker-compose.yml
services:
llama.cpp:
container_name: llama.cpp
image: ghcr.io/ggml-org/llama.cpp:server-vulkan
devices:
- /dev/dri:/dev/dri
- /dev/kfd:/dev/kfd
volumes:
- ./models:/root/.cache/llama.cpp/
ports:
- 11434:8080
restart: always
environment:
GGML_VK_VISIBLE_DEVICES: 1
command:
- "--models-preset"
- "/root/.cache/llama.cpp/config.ini"
config.ini
[GPT-OSS]
; Model Name
hf = unsloth/gpt-oss-20b-GGUF:UD-Q8_K_XL
; Set Reasoning Effort
chat-template-kwargs = {"reasoning_effort": "low"}
; Unsloth Tuning
temp = 1.0
top-k = 0
top-p = 0
; GPU Performance
n-gpu-layers = -1
flash-attn = on
; Quantize Cache
cache-type-k = q8_0
cache-type-v = q8_0
; Increased Batch Size
threads = 4
batch-size = 2048
ubatch-size = 2048
; Context shared across slots
ctx-size = 120000
parallel = 5
slot-prompt-similarity = 0.2
The prompt problem
The user editable part of the prompt will always be cached, but depending on the device being used for voice (like switching between a phone and satellite, interacting with a different room, etc) the prompt is adjusted. This breaks the cache at that point in the prompt requiring all further tokens to need to be reprocessed. To give an example of how that affects things, see the below logs from llama.cpp:
processing a new prompt section:
prompt eval time = 1534.17 ms / 4744 tokens ( 0.32 ms per token, 3092.23 tokens per second)
eval time = 300.64 ms / 38 tokens ( 7.91 ms per token, 126.40 tokens per second)
total time = 1834.81 ms / 4782 tokens
the same request with the prompt fully cached:
prompt eval time = 31.29 ms / 1 tokens ( 31.29 ms per token, 31.96 tokens per second)
eval time = 276.43 ms / 38 tokens ( 7.27 ms per token, 137.47 tokens per second)
total time = 307.72 ms / 39 tokens
As you can see the request is handled significantly faster due to not needing to reprocess a large portion of the prompt.
To solve this we set the following config:
ctx-size = 120000
parallel = 5
slot-prompt-similarity = 0.2
This sets the context size to 120k tokens split into 5 slots of equal size (24k tokens). The slot prompt similarity ensures that the user-defined portion of the prompt does not make a single slot get re-used for multiple satellites. For me this works perfectly for my 4 satellites (with an extra slot for chatting elsewhere). Once the slots are loaded with each device’s prompt, the subsequent requests are processed very quickly (~1 second).
Optimizing Batch Performance
llama.cpp provides a benchmarking tool which makes it easy to compare performance with various parameters. Here is an example command for my current hardware:
MODEL=/models/unsloth_gpt-oss-20b-GGUF_gpt-oss-20b-UD-Q8_K_XL.gguf &&
docker run -it --rm \
--device /dev/dri:/dev/dri \
--device /dev/kfd:/dev/kfd \
--shm-size 16G \
-e GGML_VK_VISIBLE_DEVICES=1 \
-v "$(pwd)/models:/models" \
--entrypoint /app/llama-bench \
ghcr.io/ggml-org/llama.cpp:full-vulkan \
-m "$MODEL" \
-b 1024,2048,4096 \
-ub 512,1024,2048 \
-ngl 999 \
-fa 1 \
-pg 4096,256 -p 0 -n 0
this will produce an output like below which allows us to see that at batch 2048, ubatch 2048 we have optimal balance of performance and memory usage
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from /app/libggml-vulkan.so
load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so
| model | size | params | backend | ngl | n_batch | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| gpt-oss 20B Q8_0 | 12.28 GiB | 20.91 B | Vulkan | 999 | 1024 | 512 | 1 | pp4096+tg256 | 1500.17 ± 4.07 |
| gpt-oss 20B Q8_0 | 12.28 GiB | 20.91 B | Vulkan | 999 | 1024 | 1024 | 1 | pp4096+tg256 | 1604.33 ± 8.50 |
| gpt-oss 20B Q8_0 | 12.28 GiB | 20.91 B | Vulkan | 999 | 1024 | 2048 | 1 | pp4096+tg256 | 1604.06 ± 7.07 |
| gpt-oss 20B Q8_0 | 12.28 GiB | 20.91 B | Vulkan | 999 | 2048 | 512 | 1 | pp4096+tg256 | 1492.21 ± 1.40 |
| gpt-oss 20B Q8_0 | 12.28 GiB | 20.91 B | Vulkan | 999 | 2048 | 1024 | 1 | pp4096+tg256 | 1603.49 ± 9.15 |
| gpt-oss 20B Q8_0 | 12.28 GiB | 20.91 B | Vulkan | 999 | 2048 | 2048 | 1 | pp4096+tg256 | 1647.49 ± 0.76 |
| gpt-oss 20B Q8_0 | 12.28 GiB | 20.91 B | Vulkan | 999 | 4096 | 512 | 1 | pp4096+tg256 | 1470.10 ± 2.68 |
| gpt-oss 20B Q8_0 | 12.28 GiB | 20.91 B | Vulkan | 999 | 4096 | 1024 | 1 | pp4096+tg256 | 1584.26 ± 1.97 |
| gpt-oss 20B Q8_0 | 12.28 GiB | 20.91 B | Vulkan | 999 | 4096 | 2048 | 1 | pp4096+tg256 | 1648.34 ± 2.38 |
build: e2f19b320 (8087)
80B on an M3 Ultra has a response time of 2-3 seconds cold, and as short a 1 second when warm for a simple request. And I have a lot of context. The upcoming M5 Max should be in that range. But perhaps the real advantage of 128GB+ today is not a smarter model, but the ability to have 80B loaded along with 30B VL.
A Studio M4 Max is probably a second of two slower than my Ultra. But that is on par with using Gemini Flash in my experience. And $2500-$3500 looks quite reasonable today when looks at the price of Nvidia cards. Although paying thousands to avoid paying google a few dollars a month is not exactly a decision based on economics.
The smaller qwen3.5 models have arrived!
Giving 35b MoE and 27b dense models a try.
I am trying now, there is however a bug in llama.cpp currently which means the prompt is not saved correctly so it is reprocessed every time. They have a fix up in PR that works.
However, so far in testing Qwen3.5 MoE at Q4_K_XL does not seem to be doing well for me, it is making a lot of the same errors that Qwen3-VL 30B Instruct made, errors that GPT-OSS does not make.
Hoping to see if a little prompt tuning helps or if it just isn’t as good at instruction following.
Man…after a full day of using the 35b MoE model, it’s frustratingly bad at adhering to my prompt (which is doubly frustrating because I fixed my llama.cpp setup and am getting the appropriate speeds out of MoE models now). This isn’t related to my HA commands however (since I’m just so happy with that setup with the model I’m using) but rather Frigate. I’ll post my results elsewhere since it’s really only tangentially related in this thread given my tests were using it for a separate purpose and I don’t want to sidetrack the conversation.
If you haven’t already, for HA I would suggest trying gpt-oss:20b. It is very fast with its native mxfp4 and it adheres to prompts very well.
Yeah it’s crazy good… it also writes flawless json.
Well I’m convinced, another something to play with over the weekend…I’d act exasperated, but, anybody in this thread knows I’m looking forward to it haha.
Seeing what you can get a new model to do is… Pretty damn cool… Tbh.
I’ve had pretty good luck with Qwen3-VL no think. Will be trying out 3.5 soon. I really like having a vision model for frigate and other things (paperlessgpt, etc)
Yeah it definitely helps now that I have two GPUs so I can run a vision model separately.
I have actually re-tried Qwen3.5 35B-A3B because I read about an issue with unsloth Q4_K_XL quants not properly being made. I switched to the MXFP4_MOE quant and that had considerably better results, still not as good as GPT-OSS, but probably good enough that with prompt tweaks I could fix it.
It is worth pointing out for Frigate Qwen3.5 is worse because it does not have DeepStack in its vision projector so it is less capable of seeing smaller objects in people’s hands for example.
What model are you using for frigate?
I use unsloth Qwen3-VL:8B Q8_K_XL
@crzynik have you tested out any of the new qwen 3.5 models? Would love to hear how they stack up to qwen3 vl. Edit: nm, just didnt scroll up! lol
It also looks like you’ve moved to gpt-oss from qwen3. I might have to try that out now while 3.5 simmers.
Final edit: I have 2 3060s in my machine and i typically expose my gpus via deploy in the docker compose file. when i try and use the vulkan image and pass through 0,1 or 1,2 along with /dev/dri:dev/dri the image is saying its not finding my cards.
Ive reverted back to cuda, but would love some help on converting this to vulkan as i presume its better than the cuda image???
llama.cpp:
container_name: llamacpp
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0", "1"]
capabilities:
- gpu
# devices:
# - /dev/dri:/dev/dri
# - /dev/kfd:/dev/kfd
volumes:
- ${ROOT}/config/lamma.cpp:/root/.cache/llama.cpp
image: ghcr.io/ggml-org/llama.cpp:server-cuda #ghcr.io/ggml-org/llama.cpp:server-vulkan #
# environment:
# GGML_VK_VISIBLE_DEVICES: 0,1
command:
- "--models-preset"
- "/root/.cache/llama.cpp/config.ini"
you didnt specify whip llama.cpp model, but regardless the response times you post is heart breaking for me lol for HA i’d use a dumber model try like 4B model there plenty of newer models made to be small and smart
Not sure what you mean by “which llama.cpp model”, I have included information about all of the models (including the exact variant and quant) in my post. Do you mean which llama.cpp build? I use vulkan currently with my 7900XTX
Once the prompts are cached, which just happens one time after starting llama.cpp and using a speaker, the response time is at most 2 seconds with it usually being 1 second. I don’t see how that could be much better, and a dumber model that makes more mistakes would waste much more time than that.
I think CUDA is probably best for Nvidia GPUs
Oh for some reason i thought you had a 3090 that you were using for your stuff and presumed vulkan (which is supported on 30 series cards) was better than cuda. I’ll stick with cuda for now, thanks. I have it working with gpt-oss and it seems to be working alright. Does gtp-oss also have VL similar to the qwen model you were using earlier? i’ve connected my llama.cpp to opencode and am poking around here as well. Really looking for the best all around model for my 2 3060s that can do all the things for HA and frigate as well as coding tasks. Im looking for the unicorn model. lol
I did have a 3090 but it was bought used and seems it was used for mining, it had stability problems so I moved on
There is no vision support in the current gpt-oss