So today, after months of tolerating amazon’s terrible software, I literally smashed up my Echo Show and put it in the bin! Seriously, if something is making me do that, it’s time to do something different. fu Bezos you low-rent vampire.
I’ve also finally got around to updating my home server to linux and am now considering getting home assistant up and running. Have a Intel Arc A380 installed in it so think i can do some basic local AI for the STT and TTS?
I’ve seen some vids about getting voice assistants running on HA and am interested in making my own units. I’m guessing this is a well-trodden path and rather than start from scratch, thought I’d come to this community for help.
I know there are commercial products you can buy now for HA, but I’m really interested in building units myself. Here’s what I’d like to achieve:
Compact unit no wider than around 20cm.
Good quality stereo sound including good bass response
Cheap, 7" display for feedback - forget touchscreen
Reasonably responsive controller - not too laggy. Wireless and bluetooth functionality.
As cheap as possible! Preferably under £40. Aiming for £30
3d printable case
I know this is possibly wishing pigs to fly, but I’m thinking raspberry pi zero 2 or the ESP range (of which i know nothing about).
So, once you’ve stopped laughing, any thoughts on any part of this and how to achieve it?
You could technically build your own Home Assistant Voice Preview Edition. The Nabu Casa folks have provided all the design files for the case, the circuit layout, and the custom PCB board. You’d have to print the custom PCB somewhere and then source all the parts, but it is possible:
I’ll just note that that device fully assembled and shipped in a pretty box is already in your price range. I suspect you wouldn’t be able to build it yourself for less, but you could probably do it for slightly more (given volume discounts on components).
Your biggest issue will be hardware, especially at that price point.
A 7 inch display is going to cost that to start with, if you want touch its over budget. You will then need an esp of some sort to run this display, more cost. You will also need a microphone or 3 to get good response, next a stereo DAC and some nice speakers.
You will then run into memory issues if running a display from an ESP32 and expecting it to be able to cope with voice and audio as well. So you will probably need more than one esp32.
Next will be the massive learning curve to write this code. Good luck and please publish any hardware and software you create as we will be interested and no doubt able to help.
And please if you end up following videos, please check that they are only a few months old, as the esphome audio has and is changing almost weekly and far from a finished thing yet. Far better to search this forum as the latest stuff will be here.
I have numerous home made voice and audio projects around the house here here all day most days. I also have a touch screen display on my desk here for interacting with HA. All is possible but rather more expensive than you are hoping for.
Really? I may be misunderstanding things, but would local AI to interpret and reply to voice commands/queries really take up that much power @ say 30 interactions a day?
To run any kind of local AI that actually responds within a few seconds will require about £2k worth of hardware I would then expect it to be only average at best. You need a very good, latest spec GPU, a good spec CPU and lots of ram.
If you want to run a small model, you may be able to get away with less, but the experience will be poor at best.
Try installing ollama on whatever PC you have and see what that can do.
Yes. You are very much misunderstanding what’s required to deliver a voice assistant experience.
Local AI capable of doing what you ask (sufficiently replace Amazon’s crap software) requires an LLM with sufficient parameters (4b is the absolute minimum but when you talk to Gpt or Alexa you’re using approximately a 120-300b parameter model…)
Those models REQUIRE a GPU. And not just any. One that can support large context (lots of vram) I’m running a 5070ti (16gvram) AND an Intel a770xe and together they’re not enough to run a 1:1 Alexa experience completely offline. (best I can do is a 20b model at 64k context)
The rig that runs that idles at ~35w and bursts to ~400w a few seconds at a time.
No. a CPU, pi etc cannot / won’t do it. You need beef with vram. Remember for LLM voice like Amazon you’re replacing thier entire DATA ENTER… It’s not just a device…
The ‘vampire’ kept thier cost small by subsidizing your equipment cost by capturing your data in thier ecosystem and selling info agile keeping operating cost as low as possible. (it matters, yih should see what I can tell about a home if I can see power use… By circuit I can probably tell you what was happening…) You don’t want to bd I thier ecosystem because that, fine (neither do I) while it is not a nuclear reactor… But it’s also NOT Cheap. Heavy upfront.
You’re buying gear.its not about how many interactions. Each one requires… That *points.
Your alternative is paid cloud model! As you’ve already said privacy is important… You can also do speech to phrase (eliminates the high GPU cost) but it’s not llm it’s voice control. (think old Alexa must match specific phrases exactly)
Then Fkr the voice endpoints themselves, I bought VPEs because simply you’re not building them or anything remotely equivalent less than 50$ ea usd. And at that price the speakers are trash compared to any Echo and have no screen. I connect mine to external speakers so it’s not an issue.
All this to say hey man I get you wanting out of Amazon. Smashing gear may have been a bit premature… Reset your expectations. And good luck with the build.
THIS is how I got out of Amazon! As soon as the Voice PEs were available, I got rid of all our Echos. For me, “speech to phrase” is perfectly acceptable for now. Works great for our simple needs. I also use external speakers with them so that helps, especially with music.
I AM interested in LLMs…in the future. For now, 2 yrs in I’m still learning my way around HA…currently tackling templates… advanced for me, but, that’s off-topic here.
Anyway, point being, as mentioned by @IOT7712 “Start small, walk before you run.”
Sidenote: @NathanCu LOVE your Friday posts! Do I understand them? Very little, but I WILL one day!
One point to add:
If you’re willed to wait for the reaction / response of your LLM up to 1/30rd of the day (which would be a latency of round about 50 minutes), you might be able to end up with cheaper hardware (still not a pi, because it won’t be able to load and execute these models at all).
But I guess you want the same fast responses as e.g. another guy that might ask his LLM questions about 500 times a day.
So no, you won’t be able to save costs on the hardware compared to a more intense use.
You just might save energy costs, as the system would more often run in idle.