My Journey to a reliable and enjoyable locally hosted voice assistant

Thyraz · January 8, 2026, 9:38pm

I linked your llm integration repo somewhere in this forums a few weeks ago (as I use it myself),
and already wondered why there’s no official thread from you about this integration.

So welcome and thank you.

skittle · January 9, 2026, 2:10pm

Glad it’s working well for you

I have noted an odd issue with it since the 2026.1 update, which looks to cause an issue with llama.cpp backend (only tested so far) when using voice inputs and Assist tooling is enabled… Im not sure if anyone else has encountered it so far (I know a lot of people don’t update immediately) but I’ve identified whats changed in the tooling definitions and will put out a fix release tomorrow

skittle · January 9, 2026, 2:17pm

Fantastic! I was a bit hesitant to put that into the bugfix release there as well, in case there was some issue with some model or server combination I hadn’t tested… but the testing I had done myself on it seemed promising enough that it was worth taking the risk on.

Because the date and time are injected at the end of the conversation, it should also mean your LLM generations stay fast, as we arent breaking the cache early on to maintain it! Compare the before-and-after response times of asking “what is the time”

Thanks I shared it on reddit when first released and in a few comments there since, but Im otherwise the type of introvert that doesnt maintain much of an online presence, so that’s my excuse for why theres no official thread

flyize · January 11, 2026, 3:09pm

Kinda trivial question, but in your prompt sometimes you use # and sometimes you use ## for comments. Does it actually care?

crzynik · January 11, 2026, 3:13pm

That is typical hierarchy for markdown, making it nested header. I’m not sure what difference it makes for the LLM, but I asked Gemini and it says that consistent hierarchy helps with its self-attention and the multiple hierarchy helps with understanding things are related.

Even if it doesn’t help the LLM though, it makes it much easier to see and manage the hierarchy of the prompt which is a benefit of its own.

NathanCu · January 11, 2026, 4:29pm

LLMs do not differentiate. They read it all. As long as it’s well structured it’ll figure it out. You can often times use comments in code to instruct the llm but cause machine readable stuff to ignore the text.

markist · January 19, 2026, 10:08am

Hi, first of all amazing thread, i learned a lot here. I just stumbled over your use of view assist, how do you use it currently? Never tried myself but what i learned from the documentation this relies on custom sentences and blueprints.

In my head in a llm setup one would probably want to create a tool for the llm so it can decide what to show on its own, right?

crzynik · January 19, 2026, 11:06am

It’s a bit different but also not really. It handles timers with blueprints but besides that I don’t use any of those. I made it so it automatically shows the LLM response when you ask a question and the weather when you ask about the weather, it is fairly easy to automate in whatever way you want to

crzynik · January 22, 2026, 2:27pm

Handling Obvious Transcription Errors

After upgrading to a Mixture of Experts model, I had to adjust my prompt in some areas as it wasn’t following instructions. I discovered during that though that it was able to handle transcription errors pretty well, it knew that “Turn on the pan” meant “Turn on the fan”.

I added the below to the prompt which makes it clear that it misheard you but made an assumption, so far it has worked pretty well.

## Inferring Intent from Transcription Errors

Speech-to-text transcription is imperfect, and words are sometimes transcribed incorrectly. You may infer the intended meaning ONLY for device control requests and weather requests — not for general questions, search queries, or other request types.

Only infer intent when ALL of the following are true:
- The request is for device control or weather information.
- The input as transcribed does not make sense or refers to something that does not exist in context.
- There is an obvious phonetically similar alternative that would make the request clear and actionable.
- You are highly confident in the correction.

When you infer intent, you MUST begin your response with "I am assuming you meant" followed by the corrected interpretation, then proceed to fulfill the request. This gives the user an opportunity to correct you if the assumption is wrong.

If you are not highly confident, or if the request is not for device control or weather, ask for clarification or respond "Does not compute." instead of assuming.

Armbrusts · January 22, 2026, 5:19pm

Can I ask what model you are using now?

crzynik · January 22, 2026, 5:23pm

It has been great overall, definitely more intelligent but it is more difficult to get it to follow directions with the old way I was doing things, presumably due to the MoE architecture.

janstadt · January 22, 2026, 8:26pm

Oh interesting. I see you mentioned that faster-whisper uses parakeet by default now. I might just switch back to that as i just followed @DrazorV’s example above. One less thing on my gpu. @crzynik would you be able to provide me your whisper config?

This is mine currently and im unsure if its good/bad:

  whisper:
    container_name: whisper
    image: lscr.io/linuxserver/faster-whisper #:gpu
    volumes:
      - ${ROOT}/config/whisper-data:/data
    environment:
      - PUID=1000
      - PGID=1000
      - TZ=${TZ}
      - WHISPER_MODEL=large-v3
      - WHISPER_LANG=en
      - WHISPER_BEAM=5
    restart: unless-stopped
    ports:
      - 10300:10300

TaterTotterson · January 23, 2026, 1:04pm

One thing to watch, specially with llm models is they like to follow patterns, there’s a good chance it starts mimicking your instructions and responding with markdown titles

crzynik · January 23, 2026, 1:14pm

Interesting, I’ll definitely watch for that, haven’t seen any of that behavior thus far

skittle · January 23, 2026, 3:09pm

One good update that sneaked into HA at some point was that it seems to strip markdown before sending output to the TTS service. I get the occasional bit of markdown from Qwen3VL at times, but haven’t heard it pronounced from a voice satellite for quite a few months now. Not sure at what point they slipped that in but it was certainly a welcome change.

Also seeing your name here @TaterTotterson reminds me I need to get about to using your microWakeWord trainer… Ive trained an excellent custom piper voice for my home Voice Assistant setup, but it really REALLY needs a custom wakeword to match

janstadt · January 24, 2026, 4:50pm

@crzynik im using your prompt and model (basically your exact setup here with the same LLM integration) with “Prefer handling commands locally” set to true. When i ask my agent to turn on a specific light, it doesnt seem to go to my local llm agent and always sends it to my llama.cpp model agent. Is there any way to get it to actually prefer the local agent? setup is VACA/View Assist configured on an echo show 8. “Alexa, turn on the living room lights” seems to go to the llama.cpp agent every time, but when i start a conversation through HASS and type in the same question, it is immediate and doesnt call llama.cpp. Super weird results but any help would be great!

crzynik · January 24, 2026, 10:14pm

Not sure why that would be the case unless the wording wasn’t quite right. That device is directly passed in as an assist device right? Is it the correct entity type?

If it only happens on VACA make sure your conversation agent is selecting the voice assistant and not directly selecting the llama.cpp agent, both are options

janstadt · January 25, 2026, 12:34am

Really wild. It looked like it was STT taking like 10 seconds. I had changed from whisper to onnx asr and then back, and somehow faster-whisper was taking 10 seconds (tried both cpu and gpu). Went back to onnx asr and it is sub second.

A011528 · January 27, 2026, 10:18am

I came across your video, which led me to your post, it looks like we’re at a very similar stage. I’m currently in the process of moving from Ollama to llama.cpp for testing *so sounds like im on the right track), and I may try some of the other suggestions you mentioned as well.
I noticed you have a Sat 1 Dev Kit, I have one too and am testing with it. I’d be interested to hear about your experience so far. The idea of reusing the Jabra was interesting if you could share more details on how you’ve approached that.
Finally, I’m curious how many devices you’ve exposed to Voice so far.

crzynik · January 27, 2026, 1:44pm

I think you may be confusing me with someone else, I did not make a video.

But to answer what I can, I’ve been quite happy with the satellite1 devices, they are pretty good for music (similar to the Google nest minis they replaced) and overall with my custom wake work it works really well.

I currently have 30 entities directly exposed but that includes a few scripts which allow it to interact with more devices