Over the past few weeks I keep seeing a lot of buzz about Googles Gemma 4 models capabilities and how it can analyse voice and images.
Unfortunately I’m quite out of the loop in terms options on the HA side for voice assistant outside of the builtin pipelines.
I was wondering, given Gemma 4 can take audio directly has anyone looked at some way to send the audio straight to the ‘conversation agent’ to cut out the middle man (currently using whisper) - I am assuming this would be better from a response time perspective, and instinctively (maybe wrong?) I’d imagine the LLM would be able to using reasoning to assume a word it may mishear in a noisy environment ?
Interested to hear what people would recommend for a basic usable VA setup utilising an A2000 12GB. Ideally looking to replicate something similar to the Alexa / Google Home experience where it has basic knowledge to answer queries and control the house & music.
Yes some new models are multimedia aware and can TTS or STT. Conversation agents and voice components need updates to use them.
New components show up every day so I’m sure someone is working on components for the models that are capable of this.
That said, it will be no faster or slower than a local voice component running whisper or in my case onnx/parakeet. (it’s subsecod response) easier… Maybe? But at that point of config you’re just pointing at a url question is if you need to setup voice as a separate component in future, capacity planning will be the saving IF you like the voices they offer.
For current however. You need both speech and model.
In the interim I will look at setting up the separate components to handle TTS / STT.
Is Ollama still the recommended approach from running the LLM models locally (Linux / Docker) ? - I’ve not looked at running any of this locally for about a year or so and it seems like there are many different options now
Also, out of interest why parakeet ? - Faster, more accurate ?
Way faster. 0.5 or less second response CPU only. For stt voices three are many options I prefer qwenTTS to any of the piper stuff. Piper works but let’s just say they could sound better…
Ollama is popular but not the fastest you will have to experiment. There are multiple engines including llama.cpp, vllm, ollama, lm studio. Etc find the one that fits your use. THAT is a function of your hardware…