As HA starts really playing a main role in my home from controling ventilation, heating, sunscreen, being a weather station and I am propably missing some…
How do I survive a critical hardware failure?
Is there some survival guide to rebuild your system?
I approach it by virtualizing and doing backups every 12 hours. You can also use virtualization for high availability and replication. Zigbee and other radio protocols will be a single point of failure, so I have backup hardware for those ready. Use network based coordinators if you virtualize. Depending on your scope you will probably have a lot of single points of failure in your home (network switch, internet router, internet connection, power failure) so it’s really up to how much money you want to throw at the different scenarios.
Manually, like you did before home automations.
Then you replace the failed component and dig your fresh backups to setup new hardware.
In case you don’t have backups, you give up, cut your veins or start building from beginning…
Whatever method you use. Make sure you understand it thoroughly, test the backup regularly and practice restore drills so that it’s not frustrating when you’re in emergency mode.
Make sure your backup is ON and backing up every day.
Make sure you HAVE your backup key if you leave encryption on
Make sure you get that file OFF your HA box and somewhere else (if the machine dies having a backup on the machine doesn’t help)
For me. Once a quarter I pick one of my recent HA backups. Spin up a HAOS VM based on latest. Install it, create a bogus user, login, connect to the backup location. Restore.
If it works great if not… I have a problem. Fix it IMMEDIATELY.
If I may…
start from the point of view that your system will suddenly die. As it statistically will. And that it will stay down for hours.
In my opinion you should start from the design phase, set up you house so that essential functions are available independently from your automations (heating security etc)
Then you can start thinking on backup/restore.
In my case I had the fortune of finding 2 identical thin client that I bought used for 60€ both. Upgraded both with a cheap 128gb ssd.
One is my primary system, backed up on an external drive daily, after the backup there is a rsync job that replicates my container data (I run everything on docker) and the rebuilding scripts on the secondary machine then spin up the dockers on the secondary system on a separate isolated docker network. This gives me the ability to test the environment without interruption and test upgrades and so on…
If you are able try to have a parallel system, even a cheap one as a backup where to thinker and experiment restore procedures.
Personally I don’t use virtualization at home, it needs resources (RAM) that my devices lacks and it may complicate the things , but it offers snapshots and easy “bare metal” backups.
Thinking about, maybe I should split up my HA in 2 installations
1 for the stuff that is fixed and will not change, that are the actual house controls as ventilation, sun screens and heating.
I should just leave that system as is and not touch it, even stop HA Updates.
And one for the stuff that is fun, less critical that also can be controls in other ways and were still relevant HA are coming. Like my EV and charger control, energy monitoring, tv control, dashboards…
So far; actual HA ‘crashes’ were due to software updates (always fixable, but still…).
I run a home automation system that I setup 20 years ago, mostly for lightning and no longer touch. Only change I did in 20 years was move the sunscreens to HA.
So part of the backup approach might be to move stuff to the ‘fixed’ system that has proper restore support and part to the fun system…
It’s a way but it have its downsides: more complexity, some update may be needed (high risk security or new functionalist needed ) and managing 2 systems is almost double the work.
You may try to compromise: as I suggested have 2 systems but use the second as a staging point and backup location. On that system test the upgrades check stability and so on .
Also make sure to have fail back systems.
For HVAC as example try to use physical thermostats that can be integrated, leave the day2day operations to the thermostat and use HA to dynamically change the settings/setpoints.
If HA fails you simply walk to the thermostat and change the settings there.
Lights the same, use always a physical button or switch to turn them on and off as needed etc.