Gemma 4 for Offline HA Assistant

swifty · April 27, 2026, 10:31am

Over the past few weeks I keep seeing a lot of buzz about Googles Gemma 4 models capabilities and how it can analyse voice and images.
Unfortunately I’m quite out of the loop in terms options on the HA side for voice assistant outside of the builtin pipelines.

I was wondering, given Gemma 4 can take audio directly has anyone looked at some way to send the audio straight to the ‘conversation agent’ to cut out the middle man (currently using whisper) - I am assuming this would be better from a response time perspective, and instinctively (maybe wrong?) I’d imagine the LLM would be able to using reasoning to assume a word it may mishear in a noisy environment ?

Interested to hear what people would recommend for a basic usable VA setup utilising an A2000 12GB. Ideally looking to replicate something similar to the Alexa / Google Home experience where it has basic knowledge to answer queries and control the house & music.

NathanCu · April 27, 2026, 10:38am

Yes some new models are multimedia aware and can TTS or STT. Conversation agents and voice components need updates to use them.

New components show up every day so I’m sure someone is working on components for the models that are capable of this.

That said, it will be no faster or slower than a local voice component running whisper or in my case onnx/parakeet. (it’s subsecod response) easier… Maybe? But at that point of config you’re just pointing at a url question is if you need to setup voice as a separate component in future, capacity planning will be the saving IF you like the voices they offer.

For current however. You need both speech and model.

swifty · April 27, 2026, 11:43am

Thanks for the info.

In the interim I will look at setting up the separate components to handle TTS / STT.
Is Ollama still the recommended approach from running the LLM models locally (Linux / Docker) ? - I’ve not looked at running any of this locally for about a year or so and it seems like there are many different options now

Also, out of interest why parakeet ? - Faster, more accurate ?

NathanCu · April 27, 2026, 11:49am

Way faster. 0.5 or less second response CPU only. For stt voices three are many options I prefer qwenTTS to any of the piper stuff. Piper works but let’s just say they could sound better…

Ollama is popular but not the fastest you will have to experiment. There are multiple engines including llama.cpp, vllm, ollama, lm studio. Etc find the one that fits your use. THAT is a function of your hardware…

I use vllm and qwen mostly.

swifty · April 27, 2026, 11:54am

Awesome thanks for the tips, those will be a great starting point so I’ll check those out