My Journey to a reliable and enjoyable locally hosted voice assistant

If you are not already I highly recommend setting your kv cache to be quantized. It considerably reduces memory usage without any noticeable loss in accuracy.

That’s why I prefer to run Qwen3-VL 30B-A3B, I can have more room for multiple context slots with large sizes.

      # Quantize Cache
      - "--cache-type-k"
      - "q8_0"
      - "--cache-type-v"
      - "q8_0"

      # Performance
      - "--flash-attn"
      - "on"

      # Support Qwen3 Template
      - "--jinja"

      # Increased batch size for prompt ingesting performance
      - "--threads"
      - "12"
      - "--batch-size"
      - "2048"
      - "--ubatch-size"
      - "1024"

      # token context shared across slots
      - "--ctx-size"
      - "72000"
      - "--parallel"
      - "4"
2 Likes

Nick how much vram is being used by that 72k context? That’s dangerously close to capable of running Friday’s entire live context. (currently 96k I think 64k Is doable)

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   51C    P5             24W /  280W |   22744MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           59249      C   /app/llama-server                     22734MiB |
+-----------------------------------------------------------------------------------------+

Seems like it would probably be doable

1 Like

That’s currently #5 on my list behind post current script updates and complete deploy script.

If I can get a sub 3 second response off a >64K context were there. I’ve been eyeballing that qwen multimodel model hard. That + Q3VL +OSS20b are a serious combo (and suddenly vram availability is a problem)

1 Like

I got Qwen3-ASR 0.6B running on my 9060XT which is being installed as a secondary system for helping out with voice specifically. Running via vLLM and this is the result. Will need more testing to see how accurate it is but this is promising.

1 Like

I actually have my Frigate model setup to run with k & v cache types to be quantized. I don’t however have my HA model setup with those flags (it introduces a slight overhead that results in ever so slightly longer wait times…tenths of seconds, which isn’t bad, but my GPU is only at 88% VRAM usage with the 32B and 8B models loaded and being used, even simultaneously so I prefer the extra speed for HA).

I tried the A3B 30B model and it’s just too slow for HA usage (for me on my card anyway). That many parameters just slows it down too much.

In that screenshot I think it shows it wasn’t processed on the LLM but rather via the HA device directly (locally versus being passed off to an LLM).

That said, I’m sure a 0.6B model would run very quickly, it’s just a matter of it understanding things.

My personal experience has been that 8B is just about perfect, 4B is functional but is prone to mistakes (saying it completed things it didn’t occasionally, telling me what lights are on when I ask but including ones that aren’t, etc) and 2B would only work with very very specific commands. Things like “what should I wear if I head out tomorrow?” or “Good morning, I’m getting out of bed” would result in just wacky replies.

I’ve never tried a 0.6B model and just began testing different models the other day since when I got everything setup mid last year so maybe there have been advances that I haven’t gotten to experience (I continue to test lol).

I think you misunderstand, Qwen3-ASR is for speech recognition (speech to text), not for the LLM itself.

Also, in regards to models / size, not sure if it was known but there is a leaderboard for LLMs in Home Assistant with some benchmarks for Home Assistant.

My experience matches this, which is why I love 30B-A3B, 8B was great but it seemed to have less “depth” and 30B-A3B is more capable of seeing past transcription errors or speech mistakes and knowing what we mean.

Of course the best model is different for everyone, which is part of what makes this fun to tinker.

2 Likes

Yup, 100% missed that lol

You never fail to share new things I haven’t come across my friend hahaha. That leaderboard is awesome, and no, it wasn’t known to me…so cool.

As for your sentiment of “best model…” and “fun to tinker” - couldn’t have said it better myself.

Playing around with this stuff makes me feel like a kid watching Sci-Fi movies and the joy/wonder of what the future might be like…and now I get to live it…it’s magical and almost surreal.

1 Like

Also an update on Qwen3-ASR, for short requests like “turn on the lights” it is very fast, but when you get into longer sentences “Turn on the lights and turn off the fan” it is significantly slower (1.5 seconds). It seems the explanation is the architecture is slower when multiple frames of audio are passed in. Perhaps it will get better as the feature matures in vLLM. For now parakeet still seems to be better overall.

1 Like

While waiting for Qwen3.5 to drop I have been playing with other models. The one thing about Qwen3-VL that is difficult is it does not respond well to negative prompting (ex: “do not …”, “never…”, etc.).

After recent llama.cpp fixes I decided to give GLM4.7 Flash a try, and it is working quite well. It seems to handle the negative prompting better, its main problem is an over-eagerness to “do something”, so it is difficult to have it correctly ignore text after a false activation. It also would try to run an action with a few words, for example just saying “dog digging” and it would run a different tool every time trying to be helpful. It also often would struggle to understand what room it was in.

For now I have gone back to Qwen3-VL again.