My Journey to a reliable and enjoyable locally hosted voice assistant

Optimizing Performance with llama.cpp

I started out by using Ollama, it is nice because it is easier to get started and most config items are handled by the API (like setting context size), however I quickly hit limitations.

llama.cpp has the following advantages that makes it easy to improve performance:

  • direct control over context being split into slots
  • easy ability to benchmark batch and ubatch sizes to find optimal options for performance

Below are my docker compose and config file for llama.cpp to generally show the optimizations that I make.

docker-compose.yml
services:
  llama.cpp:
    container_name: llama.cpp
    image: ghcr.io/ggml-org/llama.cpp:server-vulkan
    devices:
      - /dev/dri:/dev/dri
      - /dev/kfd:/dev/kfd
    volumes:
      - ./models:/root/.cache/llama.cpp/
    ports:
      - 11434:8080
    restart: always
    environment:
      GGML_VK_VISIBLE_DEVICES: 1
    command:
      - "--models-preset"
      - "/root/.cache/llama.cpp/config.ini"
config.ini
[GPT-OSS]

; Model Name
hf = unsloth/gpt-oss-20b-GGUF:UD-Q8_K_XL

; Set Reasoning Effort
chat-template-kwargs = {"reasoning_effort": "low"}

; Unsloth Tuning
temp = 1.0
top-k = 0 
top-p = 0 

; GPU Performance
n-gpu-layers = -1
flash-attn = on

; Quantize Cache
cache-type-k = q8_0
cache-type-v = q8_0

; Increased Batch Size
threads = 4 
batch-size = 2048
ubatch-size = 2048

; Context shared across slots
ctx-size = 120000
parallel = 5 
slot-prompt-similarity = 0.2

The prompt problem

The user editable part of the prompt will always be cached, but depending on the device being used for voice (like switching between a phone and satellite, interacting with a different room, etc) the prompt is adjusted. This breaks the cache at that point in the prompt requiring all further tokens to need to be reprocessed. To give an example of how that affects things, see the below logs from llama.cpp:

processing a new prompt section:

prompt eval time =    1534.17 ms /  4744 tokens (    0.32 ms per token,  3092.23 tokens per second)
       eval time =     300.64 ms /    38 tokens (    7.91 ms per token,   126.40 tokens per second)
       total time =    1834.81 ms /  4782 tokens

the same request with the prompt fully cached:

prompt eval time =      31.29 ms /     1 tokens (   31.29 ms per token,    31.96 tokens per second)
       eval time =     276.43 ms /    38 tokens (    7.27 ms per token,   137.47 tokens per second)
       total time =     307.72 ms /    39 tokens

As you can see the request is handled significantly faster due to not needing to reprocess a large portion of the prompt.

To solve this we set the following config:

ctx-size = 120000
parallel = 5 
slot-prompt-similarity = 0.2

This sets the context size to 120k tokens split into 5 slots of equal size (24k tokens). The slot prompt similarity ensures that the user-defined portion of the prompt does not make a single slot get re-used for multiple satellites. For me this works perfectly for my 4 satellites (with an extra slot for chatting elsewhere). Once the slots are loaded with each device’s prompt, the subsequent requests are processed very quickly (~1 second).

Optimizing Batch Performance

llama.cpp provides a benchmarking tool which makes it easy to compare performance with various parameters. Here is an example command for my current hardware:

MODEL=/models/unsloth_gpt-oss-20b-GGUF_gpt-oss-20b-UD-Q8_K_XL.gguf &&
docker run -it --rm \
    --device /dev/dri:/dev/dri \
    --device /dev/kfd:/dev/kfd \
    --shm-size 16G \
    -e GGML_VK_VISIBLE_DEVICES=1 \
    -v "$(pwd)/models:/models" \
    --entrypoint /app/llama-bench \
    ghcr.io/ggml-org/llama.cpp:full-vulkan \
    -m "$MODEL" \
    -b 1024,2048,4096 \
    -ub 512,1024,2048 \
    -ngl 999 \
    -fa 1 \
    -pg 4096,256 -p 0 -n 0

this will produce an output like below which allows us to see that at batch 2048, ubatch 2048 we have optimal balance of performance and memory usage

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from /app/libggml-vulkan.so
load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| gpt-oss 20B Q8_0               |  12.28 GiB |    20.91 B | Vulkan     | 999 |    1024 |      512 |  1 |    pp4096+tg256 |       1500.17 ± 4.07 |
| gpt-oss 20B Q8_0               |  12.28 GiB |    20.91 B | Vulkan     | 999 |    1024 |     1024 |  1 |    pp4096+tg256 |       1604.33 ± 8.50 |
| gpt-oss 20B Q8_0               |  12.28 GiB |    20.91 B | Vulkan     | 999 |    1024 |     2048 |  1 |    pp4096+tg256 |       1604.06 ± 7.07 |
| gpt-oss 20B Q8_0               |  12.28 GiB |    20.91 B | Vulkan     | 999 |    2048 |      512 |  1 |    pp4096+tg256 |       1492.21 ± 1.40 |
| gpt-oss 20B Q8_0               |  12.28 GiB |    20.91 B | Vulkan     | 999 |    2048 |     1024 |  1 |    pp4096+tg256 |       1603.49 ± 9.15 |
| gpt-oss 20B Q8_0               |  12.28 GiB |    20.91 B | Vulkan     | 999 |    2048 |     2048 |  1 |    pp4096+tg256 |       1647.49 ± 0.76 |
| gpt-oss 20B Q8_0               |  12.28 GiB |    20.91 B | Vulkan     | 999 |    4096 |      512 |  1 |    pp4096+tg256 |       1470.10 ± 2.68 |
| gpt-oss 20B Q8_0               |  12.28 GiB |    20.91 B | Vulkan     | 999 |    4096 |     1024 |  1 |    pp4096+tg256 |       1584.26 ± 1.97 |
| gpt-oss 20B Q8_0               |  12.28 GiB |    20.91 B | Vulkan     | 999 |    4096 |     2048 |  1 |    pp4096+tg256 |       1648.34 ± 2.38 |

build: e2f19b320 (8087)
2 Likes