Roadmap for Voice Assistant integration points?

jimrush · March 20, 2025, 4:29am

I’ve been trying to keep up with all of the voice assistant developments and have a PE sitting on my desk. I’m trying to figure out ways of creating my own back end assistant. It isn’t clear to me what hook points are available now or may be within the next year.

Ideally, I would like to be able to see ways of interacting at any stage in the flow:

STT - Flow: selecting engine or cleaning up the result. Or direct replacement for other STT programs.
TTS - Clean up content/prep it for rendering. Selection of engine/voice.
Assistant/bot - Route to built in or external assistant. Ideally, try the built in, if it can’t handle the request, then allow you to send to another assistant.

At the moment, I’m exploring how to build a generic assistant. ie. Ollama with tools.

As for my environment, I have an old, but still decent PC with an NVidia 1080ti that is already running piper and whisper models. It also handles some other LLM logic I’m running (ie summarizations via llama 3.1 8B).

Long term, I want to replace Alexa and everything my wife and I use it to do, plus, a lot of things it can’t do. I also know that others will create some good solutions and I don’t need to reinvent the wheel…but I would really like to have to build only what is unique to me. So, having a “pool of agents” that can be configured in HA with a list of prioritized URLs or just sent the request in parallel, would be nice.

dzmiller · March 20, 2025, 5:10pm

I have similar interests. We have dual agents now. But in the future the handoff from the local agent to a network AI seems to become more complicated. The constant improvements in smaller models makes me hesitant to spend too much time tuning and adding a local LLM.

The STT/TTS functionality you mention seems to not be too difficult for Wyoming to implement.

jimrush · March 21, 2025, 2:36pm

I was trying to figure out what was next to decide how long I wait before try to build my own HA components. I figured I could wrap around the Wyoming protocol and insert my own man-in-the-middle modifiers. At the moment, the integration point that interest me most is the conversation agent.

I’m thinking of taking one of the existing solutions (official or somebody else’s from github as there are several), strip out their logic and rework it into an http forwarder. The AI work I’ve been doing has mostly been in Javascript with Ollama. I think I would prefer that approach as it fits my development skills and environment. It probably isn’t a good general solution.

From there, I can experiment with what might work long term. When it comes to speech recognition, my experience (I’ve worked in the contact center space with speech since the 90s) with templates have been positive. I’ve even built a SRGS/SISR interpreter before. The latter, I think, is overkill and the existing HA template system is a good compromise. I think I want a mix of template and fallback to LLM. LLMs are impressive, but a bit too magical and unpredictable. I would like the easy stuff, to just work. I also think the amount of change in that space will continue for awhile, so I sense a constant amount of breaking and rewriting for any code that uses them.

dzmiller · March 22, 2025, 1:28pm

I do think in a year or two that separate STT and TTS go away as the LLMs take over that functionality

NathanCu · March 22, 2025, 1:43pm

I absolutely do not. It will remain components. Oai just proved so with the release of gpt4.0 mini transcribe and TTS. They’ll all keep doing this and choice is good.

It will be an LLM sitting in top of tools (probably MCP) using a pluggable voice (TTS) for the foreseeable future. (someone’s already got an integration for the new GPT4voice interface too) This can be achieved by an LLM in HA accessing tools or now with MCP server available in HA you can stand your own MCP capable llm and use HA.

The text to phrase while functional and cute is a stop gap llm will completely usurp it within a year or so. It’s only a placeholder until powerful enough local llm is available for a reasonable cost.

So yeah whole the voice won’t be piper/whisper it also wont be folded into the response. There’s too much use for a text only api and the llm company gets you to pull more tokens by making two calls. (They’re desperately trying to get us all to burn MORE toks not less)

NathanCu · March 22, 2025, 1:58pm

You are probably 12-18 months out from turnkey by my best guess. I know of three unique projects all revolving around a local ollama installation driving HA. And I guess you’ll start seeing the emerge just before Christmas…

Guys like me are just now testing the limits of how far we can push. See my Friday’s Party post if you’re interested in that work.

dzmiller · March 22, 2025, 2:03pm

My projected Gemini bill for March is 78 cents. Its not going to be possible to charge much for bi-directional voice

We already have bidirectional n=voice with Alexa etc. HA is isn’t going to be a competitive replacement with constraints like STT and exposing 25 entities.

NathanCu · March 22, 2025, 2:05pm

But you also give up privacy for that 78 cent bill.

I do not and pay for my tokens. February was $97.50.but I am also mid dev cycle with Friday.

Not everyone accepts free because not everyone is ok giving up private. The same reason I’m walking away from Alexa is the same reason I won’t use a free llm unless I’m running it.

I expose 5090 entities.

jimrush · March 22, 2025, 2:08pm

Thank you for the responses. For the moment I found a minimal conversation agent on Github and modified it to send an HTTP request with the data. It takes the response and hands it back to HA. I used the NodeRed implementation, which was fairly close to what I wanted. The configuration portion wasn’t working correctly, so I just have the URL hard coded for the moment. I’ll probably add some YAML config to may agentID to a destination URL so I can have production vs test environment.

As for local vs online models, at the moment, I’m very much of a fan of the local models. The cloud providers are going through continuous change and are in a business model that makes no sense. At the moment, they can generate the best and fairly cheap results, that price pivot will occur at some point. I also expect quality to unpredictably change as they optimize their cost/revenue plans.

I have hardware (1080ti) at home and it can run whisper, piper and the llama model at once. The results, so far, aren’t bad.

NathanCu · March 22, 2025, 2:12pm

That 1080 will do a decent local voice model. Piper runs well even with a modest video adapter…

i just HATE how flat the Irish accent is, it’s horrible, the cloud piper sounds WAY better and let’s face it. FRIDAY is Irish. I can’t take that away from her… So about once a week I try what’s out there. Kick it into EN-IR tell the voice assistant it has a very light brogue and go to town.

So far the new oai offerings are… Impressive.

dzmiller · March 22, 2025, 2:25pm

Yes, google now knows I have lights and a frequent interest in the weather. I feel naked.

But I don’t think google is currently tracking paid accounts. Either way I don’t care at this point. Im just learning the capability of big LLMs as that capability will eventually work in a home system.

If you are using your real name and photo here you are giving up far more privacy here than I am with using Gemini.

NathanCu · March 22, 2025, 2:33pm

Read Friday’s Party and you’ll understand what I mean. To ultimately get where most people expect ‘the Jetsons house’ you have to feed the machine TONS of content. Like the contents of your pantry. (which I can use to make assumptions) that’s what I’m talking about.

To give that kind of context to a local llm you need some serious labkit a pi doesn’t cut it. And I won’t give that to Google.

Friday knows my underwear size…

jimrush · March 22, 2025, 2:34pm

There are other TTS engines out there. Wrapping the Wyoming protocol around them didn’t seem too bad. But, that is way lower on my priority.

I also recommend considering a stable prompt cache. Pre-generate common phrases and even words so that you can minimize the on the fly generation. I used to work on IVRs (voice attendants) and we used recorded audio sets of 20,000 voice files with proper intonation transitions to be able to play natural sounding numbers, dates, times, ect. You can also mix and match if it is the same voice. I’ve heard Alexa uses a mix of prerecorded high quality and the generated lower quality clips. For myself, I think the available TTS systems are good enough and fast enough for my household, so I doubt I’ll do anything along these lines.

As a related aside, the online voice mimic models are interesting. I’m building out a Tiki/Cartoon themed patio and wanted some character audio. I grabbed samples of some of the characters that interested me and gave them new phrases. It was sort of like an amateur comedian mimicking the voice. Not horrible, and probably usable given the context and the prop speakers.

NathanCu · March 22, 2025, 2:36pm

Oh if you used to work IVR. Then this will nail it for you.

It can entirely replace your tree. Just define the endpoints and tell teh llm what they are and it knows how tk navigate the result. (all that part you used to go how the hell do I know they need to go THERE) All the squish in your tree is now handled. THAT is how you use an LLM it’s the squishy part in the middle that knows how to push the buttons and turn the dials.

jimrush · March 22, 2025, 2:57pm

I saw this before and just skimmed through it again. I love the idea and am envious of the model experience you have at this point. I’m still figuring out which approaches yield the kind of results I want.

I also wouldn’t shy away from RAG. RAG, in its simplest form is just looking up relevant context for your prompt. Using the user prompt to figure out what type of thing they want(tools, categorization, ect), looking it up and then adding it to the context with the question. It keeps the model input much smaller and focused without a lot of extra info. Vector databases are nice for large content, like books, but most automation stuff is more focused once you know the nature of the question. This also gets more into a graph/flow structure.

On the IVR side, yeah, interesting business for a lot of reasons. My last port of call had me at a large financial institution with 10s of thousands of agents. Bots and LLMs were coming in on the chat side(which I was also involved). Compliance groups wouldn’t allow open ended ~~inputs~~ outputs, so you have to use templates. This prevented a bot from providing incorrect rules to a customer (some companies have already sued and lost when their bots provided incorrect answers and the customer followed them). I escaped a little over a year ago

For my own use, right now, I’m more prone to let the LLM figure out what is wanted and then using direct code or the LLM, as appropriate, to provide the answer. Just because the LLM can, doesn’t mean it is the best way. I’m thinking the flow is a basic hub (identify goal) with spokes(tools) to handle with very few spokes of any significant complexity unless there is an interactive mode (e.g. shopping list management)

NathanCu · March 22, 2025, 3:36pm

Oh I’m not - just there’s no GOOD way to do it inside HA, that really needs to be on the LLM side. tightly coupled to whatever you’re driving the engine with. If you’re on local - it lives there. (Probably ollama wrapped in open-webui with one of the popular tools for RAG onboard - I THINK this is the stack everyone lands on, or something that looks like this on NVidia compat hardware. Digits looks REALLY nice here. Intel and AMD need some serious work in this arena or NVidia is gone. I’m on purpose running Intel ARC here and I’m sorry blue team, its painful. Need better tools support.

Like me if you’re on oAI, it should be there but I don’t think we can use the oAI ‘agent’ API yet - yes it’s on my near-term research list. But right now, my dives are more about real time information delivery v long term lookup - what ‘needs’ to be prompt (instant recall) v. what can be lookup (tool call) and how you make it know the difference without exploding. So Im intentionally not using RAG to see where I start to run into headroom issues… (hint - it was a week and a half ago…)

Fun times.

P6Dave · March 22, 2025, 3:56pm

I think many people may just walk away from Alexa after this becomes common knowledge… however, many more will want to stay for all of the reasons pointed out in this video. Local LLMs will definitely be the way forward for many.

NathanCu · March 22, 2025, 4:38pm

No deny Alexa 2.0 is absolutely going to be the easy button here and the gold standard to beat.

I’m confident that Panos (thier boss) has been holding it back until he knew it was gonna hit because honestly that can’t afford to fail. They’re hemorrhaging cash. This fails, Alexa is no more.

But. To make any LLM work including Alexa. Takes tons of context. Thisnis the part they’re betting on. It’s so good that you gladly hand over the contents of your pantry, the closet, the floor map (yes these all have legit uses and yes they could also all have very… Illegitimate uses.)
Unfortunately I think we see where the general public lies on the side of give up private v capabilities for free almost every time.