My Journey to a reliable and enjoyable locally hosted voice assistant

crzynik · February 20, 2026, 10:41pm

Yeah, the prompt processing speeds I have seen on shared memory machines (Halo Strix, Mac M5) leave a lot to be desired. But it is definitely nice to have options depending on priorities.

crzynik · February 20, 2026, 11:07pm

Optimizing Performance with llama.cpp

I started out by using Ollama, it is nice because it is easier to get started and most config items are handled by the API (like setting context size), however I quickly hit limitations.

llama.cpp has the following advantages that makes it easy to improve performance:

direct control over context being split into slots
easy ability to benchmark batch and ubatch sizes to find optimal options for performance

Below are my docker compose and config file for llama.cpp to generally show the optimizations that I make.

docker-compose.yml

services:
  llama.cpp:
    container_name: llama.cpp
    image: ghcr.io/ggml-org/llama.cpp:server-vulkan
    devices:
      - /dev/dri:/dev/dri
      - /dev/kfd:/dev/kfd
    volumes:
      - ./models:/root/.cache/llama.cpp/
    ports:
      - 11434:8080
    restart: always
    environment:
      GGML_VK_VISIBLE_DEVICES: 1
    command:
      - "--models-preset"
      - "/root/.cache/llama.cpp/config.ini"

config.ini

[GPT-OSS]

; Model Name
hf = unsloth/gpt-oss-20b-GGUF:UD-Q8_K_XL

; Set Reasoning Effort
chat-template-kwargs = {"reasoning_effort": "low"}

; Unsloth Tuning
temp = 1.0
top-k = 0 
top-p = 0 

; GPU Performance
n-gpu-layers = -1
flash-attn = on

; Quantize Cache
cache-type-k = q8_0
cache-type-v = q8_0

; Increased Batch Size
threads = 4 
batch-size = 2048
ubatch-size = 2048

; Context shared across slots
ctx-size = 120000
parallel = 5 
slot-prompt-similarity = 0.2

The prompt problem

The user editable part of the prompt will always be cached, but depending on the device being used for voice (like switching between a phone and satellite, interacting with a different room, etc) the prompt is adjusted. This breaks the cache at that point in the prompt requiring all further tokens to need to be reprocessed. To give an example of how that affects things, see the below logs from llama.cpp:

processing a new prompt section:

prompt eval time =    1534.17 ms /  4744 tokens (    0.32 ms per token,  3092.23 tokens per second)
       eval time =     300.64 ms /    38 tokens (    7.91 ms per token,   126.40 tokens per second)
       total time =    1834.81 ms /  4782 tokens

the same request with the prompt fully cached:

prompt eval time =      31.29 ms /     1 tokens (   31.29 ms per token,    31.96 tokens per second)
       eval time =     276.43 ms /    38 tokens (    7.27 ms per token,   137.47 tokens per second)
       total time =     307.72 ms /    39 tokens

As you can see the request is handled significantly faster due to not needing to reprocess a large portion of the prompt.

To solve this we set the following config:

ctx-size = 120000
parallel = 5 
slot-prompt-similarity = 0.2

This sets the context size to 120k tokens split into 5 slots of equal size (24k tokens). The slot prompt similarity ensures that the user-defined portion of the prompt does not make a single slot get re-used for multiple satellites. For me this works perfectly for my 4 satellites (with an extra slot for chatting elsewhere). Once the slots are loaded with each device’s prompt, the subsequent requests are processed very quickly (~1 second).

Optimizing Batch Performance

llama.cpp provides a benchmarking tool which makes it easy to compare performance with various parameters. Here is an example command for my current hardware:

MODEL=/models/unsloth_gpt-oss-20b-GGUF_gpt-oss-20b-UD-Q8_K_XL.gguf &&
docker run -it --rm \
    --device /dev/dri:/dev/dri \
    --device /dev/kfd:/dev/kfd \
    --shm-size 16G \
    -e GGML_VK_VISIBLE_DEVICES=1 \
    -v "$(pwd)/models:/models" \
    --entrypoint /app/llama-bench \
    ghcr.io/ggml-org/llama.cpp:full-vulkan \
    -m "$MODEL" \
    -b 1024,2048,4096 \
    -ub 512,1024,2048 \
    -ngl 999 \
    -fa 1 \
    -pg 4096,256 -p 0 -n 0

this will produce an output like below which allows us to see that at batch 2048, ubatch 2048 we have optimal balance of performance and memory usage

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from /app/libggml-vulkan.so
load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| gpt-oss 20B Q8_0               |  12.28 GiB |    20.91 B | Vulkan     | 999 |    1024 |      512 |  1 |    pp4096+tg256 |       1500.17 ± 4.07 |
| gpt-oss 20B Q8_0               |  12.28 GiB |    20.91 B | Vulkan     | 999 |    1024 |     1024 |  1 |    pp4096+tg256 |       1604.33 ± 8.50 |
| gpt-oss 20B Q8_0               |  12.28 GiB |    20.91 B | Vulkan     | 999 |    1024 |     2048 |  1 |    pp4096+tg256 |       1604.06 ± 7.07 |
| gpt-oss 20B Q8_0               |  12.28 GiB |    20.91 B | Vulkan     | 999 |    2048 |      512 |  1 |    pp4096+tg256 |       1492.21 ± 1.40 |
| gpt-oss 20B Q8_0               |  12.28 GiB |    20.91 B | Vulkan     | 999 |    2048 |     1024 |  1 |    pp4096+tg256 |       1603.49 ± 9.15 |
| gpt-oss 20B Q8_0               |  12.28 GiB |    20.91 B | Vulkan     | 999 |    2048 |     2048 |  1 |    pp4096+tg256 |       1647.49 ± 0.76 |
| gpt-oss 20B Q8_0               |  12.28 GiB |    20.91 B | Vulkan     | 999 |    4096 |      512 |  1 |    pp4096+tg256 |       1470.10 ± 2.68 |
| gpt-oss 20B Q8_0               |  12.28 GiB |    20.91 B | Vulkan     | 999 |    4096 |     1024 |  1 |    pp4096+tg256 |       1584.26 ± 1.97 |
| gpt-oss 20B Q8_0               |  12.28 GiB |    20.91 B | Vulkan     | 999 |    4096 |     2048 |  1 |    pp4096+tg256 |       1648.34 ± 2.38 |

build: e2f19b320 (8087)

dzmiller · February 21, 2026, 1:19am

80B on an M3 Ultra has a response time of 2-3 seconds cold, and as short a 1 second when warm for a simple request. And I have a lot of context. The upcoming M5 Max should be in that range. But perhaps the real advantage of 128GB+ today is not a smarter model, but the ability to have 80B loaded along with 30B VL.

A Studio M4 Max is probably a second of two slower than my Ultra. But that is on par with using Gemini Flash in my experience. And $2500-$3500 looks quite reasonable today when looks at the price of Nvidia cards. Although paying thousands to avoid paying google a few dollars a month is not exactly a decision based on economics.

FantasyMaster85 · February 24, 2026, 8:00pm

The smaller qwen3.5 models have arrived!

Giving 35b MoE and 27b dense models a try.

crzynik · February 24, 2026, 8:03pm

I am trying now, there is however a bug in llama.cpp currently which means the prompt is not saved correctly so it is reprocessed every time. They have a fix up in PR that works.

However, so far in testing Qwen3.5 MoE at Q4_K_XL does not seem to be doing well for me, it is making a lot of the same errors that Qwen3-VL 30B Instruct made, errors that GPT-OSS does not make.

Hoping to see if a little prompt tuning helps or if it just isn’t as good at instruction following.

FantasyMaster85 · February 25, 2026, 9:46pm

Man…after a full day of using the 35b MoE model, it’s frustratingly bad at adhering to my prompt (which is doubly frustrating because I fixed my llama.cpp setup and am getting the appropriate speeds out of MoE models now). This isn’t related to my HA commands however (since I’m just so happy with that setup with the model I’m using) but rather Frigate. I’ll post my results elsewhere since it’s really only tangentially related in this thread given my tests were using it for a separate purpose and I don’t want to sidetrack the conversation.

crzynik · February 25, 2026, 9:56pm

If you haven’t already, for HA I would suggest trying gpt-oss:20b. It is very fast with its native mxfp4 and it adheres to prompts very well.

NathanCu · February 25, 2026, 10:02pm

Yeah it’s crazy good… it also writes flawless json.

FantasyMaster85 · February 25, 2026, 10:10pm

Well I’m convinced, another something to play with over the weekend…I’d act exasperated, but, anybody in this thread knows I’m looking forward to it haha.

NathanCu · February 25, 2026, 10:24pm

Seeing what you can get a new model to do is… Pretty damn cool… Tbh.

bgreet · February 27, 2026, 12:28am

I’ve had pretty good luck with Qwen3-VL no think. Will be trying out 3.5 soon. I really like having a vision model for frigate and other things (paperlessgpt, etc)

crzynik · February 27, 2026, 12:35am

Yeah it definitely helps now that I have two GPUs so I can run a vision model separately.

I have actually re-tried Qwen3.5 35B-A3B because I read about an issue with unsloth Q4_K_XL quants not properly being made. I switched to the MXFP4_MOE quant and that had considerably better results, still not as good as GPT-OSS, but probably good enough that with prompt tweaks I could fix it.

It is worth pointing out for Frigate Qwen3.5 is worse because it does not have DeepStack in its vision projector so it is less capable of seeing smaller objects in people’s hands for example.

bgreet · February 27, 2026, 12:37am

What model are you using for frigate?

crzynik · February 27, 2026, 12:38am

I use unsloth Qwen3-VL:8B Q8_K_XL

janstadt · February 28, 2026, 1:50pm

@crzynik have you tested out any of the new qwen 3.5 models? Would love to hear how they stack up to qwen3 vl. Edit: nm, just didnt scroll up! lol

It also looks like you’ve moved to gpt-oss from qwen3. I might have to try that out now while 3.5 simmers.

Final edit: I have 2 3060s in my machine and i typically expose my gpus via deploy in the docker compose file. when i try and use the vulkan image and pass through 0,1 or 1,2 along with /dev/dri:dev/dri the image is saying its not finding my cards.

Ive reverted back to cuda, but would love some help on converting this to vulkan as i presume its better than the cuda image???

  llama.cpp:
    container_name: llamacpp
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0", "1"]
              capabilities:
                  - gpu
    # devices:
    #   - /dev/dri:/dev/dri
      # - /dev/kfd:/dev/kfd
    volumes:
      - ${ROOT}/config/lamma.cpp:/root/.cache/llama.cpp
    image: ghcr.io/ggml-org/llama.cpp:server-cuda #ghcr.io/ggml-org/llama.cpp:server-vulkan #
    # environment:
      # GGML_VK_VISIBLE_DEVICES: 0,1
    command:
      - "--models-preset"
      - "/root/.cache/llama.cpp/config.ini"

zodyking · February 28, 2026, 2:32pm

you didnt specify whip llama.cpp model, but regardless the response times you post is heart breaking for me lol for HA i’d use a dumber model try like 4B model there plenty of newer models made to be small and smart

crzynik · February 28, 2026, 2:47pm

Not sure what you mean by “which llama.cpp model”, I have included information about all of the models (including the exact variant and quant) in my post. Do you mean which llama.cpp build? I use vulkan currently with my 7900XTX

Once the prompts are cached, which just happens one time after starting llama.cpp and using a speaker, the response time is at most 2 seconds with it usually being 1 second. I don’t see how that could be much better, and a dumber model that makes more mistakes would waste much more time than that.

crzynik · February 28, 2026, 2:48pm

I think CUDA is probably best for Nvidia GPUs

janstadt · February 28, 2026, 3:21pm

Oh for some reason i thought you had a 3090 that you were using for your stuff and presumed vulkan (which is supported on 30 series cards) was better than cuda. I’ll stick with cuda for now, thanks. I have it working with gpt-oss and it seems to be working alright. Does gtp-oss also have VL similar to the qwen model you were using earlier? i’ve connected my llama.cpp to opencode and am poking around here as well. Really looking for the best all around model for my 2 3060s that can do all the things for HA and frigate as well as coding tasks. Im looking for the unicorn model. lol

crzynik · February 28, 2026, 3:22pm

I did have a 3090 but it was bought used and seems it was used for mining, it had stability problems so I moved on

There is no vision support in the current gpt-oss