I’ve spent a couple of days muddling thru this Local AI stuff with a twist.
I’m running my day to day HA on a Pi (yea yea) and am implemlenting a voice-only HA on a Dell 9020 (HAOS no virtualization).
For yuks, I’ve been seeing if Copilot can guide me thru it. It’s all sorts of stuck on old techniques and UIs (Add-ons, Supervisor, Pipelines, etc).
From what I understand so far:
_- I need Whisper, Piper, S2Phrase, Keyword, Wyoming, Local-LLM (from HACS), https web broswer and a device with a mic/speaker (tablet, phone, ESP).
I may need more than one VA for intent vs conversation vs xyz
I have mine doing simple stuff like turn on lights, but “How many fans are on” gives me intent issues.
My questions as I head off to a couple of threads in here.
Can I run another GGUF somehow locally? I found one i want to try from HuggingFace but can’t figure out how to incorporate it.
Is there a good (up-to-date) deployment guide I can go read?
First what is your expectation. What experience what endgame do you want.
Reason: you have a huge sliding window of capabilities but they also proportionally scale price.
My Friday build currently clocks in at somewhat around $6000 USD in gear. But you don’t need to spend that. What yihr expected result will help me set yojr expectations for what yih need to get there
Great framing @NathanCu .
I’m very happy with my Smarthings legacy setups (3 homes/locations) for about 4 years with 200+ devices. I’m running HA about 2 years now and doing more since the demise of Actiontiles I used on 4 FireTablets - so redoing some dashboarding.
Along the way, I’ve decided to finally start investigating voice. The wife is adamant she doesn’t want Google/Amazon/.cloud listening in so I’m looking to deploy local AI. I’ve got a bunch of hardware laying about, PIs, Dell 9020s, HPZ800 with 16 cores and 96GB RAM, Gig networks, POE most everywhere etc.
My current roadmap is:
Keep Smartthings (so I don’t have to rejoin everything to HA)
Keep HA core functionality (cloud/integrations/dashboarding) on my PI (for now)
Deploy a voice/AI server on “something” (HyperV, Promoxx, BareMetal) and get that going. Yes I realize GPUs make the experience better etc.
So the core driver is make it so we can say:
Hey Monkey, turn on the bonus room fan (even if we get the name a little wrong)
Hey Monkey, how many lights are on?
Hey Monkey, are any of the garage doors open?
Hey Monkey, turn off/on all the outdoor lights.
Nope - They’re 100% REQUIRED for LLM Local. (Pin that - I have to explain what 100% local LLM entails) You can have all the CPU you want (nice collection) but when you want LLM work you NEED a GPU. (NOT OPTIONAL)
So if you WANT local LLM inference you’re buying. Skipping a lot of the analysis - for your standard McMansion, you’re going to need to expose somewhere between 30-200 entities. Short version - HA simply cannot maintain entity registry, all the tools and instructions required to make an LLM work reliably unless you use a modern model (lets say something in the qwen3 or newer range, and supports at least 8K context. (translation: at least 8, prob 16GiB Vram GPU)
But speech to phrase is very different - it’s speech onguardrails with deterministic scripts.
OK - and the BIG question is do you want to have to stay on specific phrasing (see speech to phrase)
LLMs bring language comprehension and thinking.
If you want a glorious voice remote - you want speech to phrase to start.
BUT you want an LLM agent - well… Pull out the pocketbook…
When you replace Amz or Google you need to fundamentally replace:
Hardware: I use VPEs, there are a few options
WakeWord: you have two choices here, I use On device microwakeword. You can opt for server driven
ASR/STT: Mine is parakeet on a local server - you can use a cloud, and HA provides one by default.
Inference (Not required if you STP, this can be a cloud service or a local service - if you use cloud I STRONGLY recommend using a paid account and turn off all info sharing toggles…) If you want tru OAI type experiences here you need to be able to run something like the new QWEN3.5 9b instruct AND A 35b thinking simultaneously… or better - that’s a SIGNIFICANT card. (You don’t need all that to start but that’s ChatGPT / new alexa levels of comprehension and action.) So this is the tier where you commit. I’m personally committed all the way to an NVidia Grace Blackwell GB10 DGX Spark / 128GiB… Yes, it CAN run those simultaneously and yes it costs how much you think it does, but context space is no longer my concern
TTS - Whisper ships as an addon, there’s cloud alternatives, I run Kokoro on the same serveer running STT,
Most of this is cheap - One VERY MUCH is not but you can sidestep that until you’re ready.
Yes, your local six thousand dollar Raspberry Pi, connected to your private nuclear powered electricity substation over the back fence.
Your challenge, far greater than getting it working smoothly without consistently producing hallucinating slop, is to convince the wife the expenditure is necessary.
Like self driving electric cars, maybe wait till everything settles down a bit before taking the plunge?