Host System Health Discussion

Isrial_Pritchard · May 14, 2021, 1:01pm

Just had my first system loss!! Yay. I feel it’s a badge of honour all us HA’ers get when we catastrophically lose our installation. Usually when you are already overwhelmed and stressed with little time to fix! But ha ho all is fixed. My system has been cured.

So what is better than a cure? . . . prevention of course. My failure was the HDD. A new SSD I bought 3 months ago. All was going fine and fast but recently I wanted to start setting up some ESP32home devices and thought. You know what - think it about time I did a clonezilla. I have snapshots but wanted to take a full image for easy recovery if I needed to then to start doing it weekly. So I shut her down ready for cloning . . . then saw some ominous messages about not being able to write back standard Debian data, then more messages and then more serious ones then the shutdown failed! My supervised HA on Buster was crashed 1/2 through a shutdown. I held the power switch off and then on again and prayed.

. . . Like the Titantic, that disk and the DB on it is now lost, and the souls (bits) onboard her have gone too. No amount of rescue disks could recover the filesystem. The disk cannot even be formatted!!

So, a new disk has gone in, rebuilt and snapshot restored. Great. However, all history lost

Now for the open discussion: prevention & disaster recovery.

First prevention - how do you monitoring your systems health? I don’t mean the System Health integration as that would not have caught my errors but a real deep monitoring like synology has or OMV5. Seems a bit odd that HA, a home monitoring specialist, doesn’t first and foremost, monitor itself?! Or maybe it does and I missed the warning signs?

Second DR? HA collected a season of temp data from every room in my house last winter ready for me to make some smart logic and install new TRV valves this winter. Sadly, all the history I have now is . . . 24hrs

So I have learnt the hard way that snapshot are only config data. Even a FULL snapshot is not like a VM FULL snapshot. Nothing like it at all.

So what do you do to back up your entire system? I was going to clonezilla mine but not really a fan of taking it off line while I do. Am thinking about using linux dd to clone an image to my OVM5. Maybe there is already something out there that makes it even easier like - insert USB and press go. If system fails insert/boot from that USB and restore.

Would love to hear everyones thoughts and suggestions. Maybe a good podcast topic?

CO_4X4 · May 14, 2021, 1:11pm

I haven’t yet found a great way to clone the full drive data automatically, but the way I handle disaster recovery right now is:

I have an exact duplicate of the Pi that I use for other less-mission-critical things so if there is a hardware failure it’s plug-and-play
I schedule home network maintenance once per month across all my network and this includes having a second drive with HA installed on it that I update during that maintenance so that if there’s a failure I’m only a couple updates away from where my live system is now
Along with daily snapshots I also automatically copy the snapshot up to another system so I have it offline
I have a script I wrote that stops HA every early morning and takes a clean copy of the database (since having HA running during snapshot most often causes the DB to be unusable on a restore) and that gets pushed to another server with my daily snapshot. This has the added benefit of allowing me to find any critical processes that don’t retain their state on restarts and come up with better ways to keep them persistent (I use the variables extension for that)

Fortunately I haven’t earned my “total HA failure scout badge” but I have had bad updates that required me to restore a snapshot and found out the hard way that database restores don’t work. I’m not as wigged out about losing history as some folks but it would be a shame to lose it. Even if you don’t want to script the stopping of HA to get a clean DB backup it’s not a bad idea to just SSH into HA every now and then to do that process manually so you can minimize any possible history loss.

tom_l · May 14, 2021, 1:51pm

I saw a drive SMART data sensor on the forum somewhere… I should implement monitoring that.

HA failure recovery for me is grabbing one of my spare PCs and burning a new HAOS image to the drive then restoring a nightly snapshot from my NAS. Shouldn’t take longer than an hour at most.

While 5000km away for a year I had a system all set up and ready to go but powered down by a smart switch (with a web interface not just HA controlled). The plan was to remote in to my network, power it up, change the IP address and restore the latest snapshot. However while away support for Ubuntu Server Supervised installs was dropped. It probably would still have worked if I’d had to use it, but these unforeseen things happen. I had to use supervised as HAOS did not have support for the NIC in that PC.

CO_4X4 · May 14, 2021, 2:01pm

I do the same thing when I head out in my RV for a trip, have an HA backup on standby with a wifi outlet switch that I can fire up if needed.

code-in-progress · May 14, 2021, 2:11pm

For me it’s separation of all critical systems (HA, DB, InfluxDB, MQTT, etc) on separate machines. This way, if my HA system dies, my historical data won’t be affected.

Plus, having both a local and remote backup store should be required. For local, I use UrBackup hosted on my Unraid NAS. All my systems do a full image backup (complete drive) every 5 days and then incremental file/image backups every 4 hours. Then, I use Crashplan Pro to store all my image backups remotely and those get uploaded on every image change.

Isrial_Pritchard · May 15, 2021, 5:17am

Many approaches huh. I guess that reflects the many installation types.

The script to stop, copy and start the dB is a good one. @CO_4X4 any chance you might share it or share where you got it from?

richieframe · May 15, 2021, 6:16am

I now perform a full HA backup (config and DB) every time I update it (via script), my history DB is only about a gig, so it only takes a few seconds. Other backups use rsync to another physical drive, and I am working on a remote backup solution using duplicati and backblaze

I use netdata for realtime monitoring of the system, I was able to recently diagnose some extremely odd performance regressions which turned out to be fragmentation of a large BTRFS volume using this.

My system is multiuse and as a lot of stuff going on, including multiple database servers, raid controllers, NFS, SMB, DNS, etc… so there is a lot to monitor and a lot of configuration files to backup

My drives are on a hardware raid controller which will sound an audible alarm if there is a drive failure. Also it reports any issue with the disks of any kind in a browser interface. In fact, I saw my first bad block on a disk in more than 6 years just a few weeks ago, so it may be time to do a full array clone to a new system

DavidFW1960 · May 15, 2021, 6:23am

Could consider proxmox with a VM for HA which you can backup 100% (as much as I hate to recommend Proxmox)

CO_4X4 · May 15, 2021, 2:18pm

Sure. I just have a simple shell script in /config/backupdb.sh:

ha core stop
cp /config/home-assistant_v2.db /config/home-assistant_v2.bak
ha core start

mcarty · May 15, 2021, 3:17pm

Consider configuring Recorder to only include the data you really need, and purgue for one or two weeks max.

Just use InfluxDB / Grafana and you wont loose the data. That way you can store years of data.

Isrial_Pritchard · May 15, 2021, 10:35pm

Thank @mcarty good advise. My installation (like many) was supervised so I may have installed influx/grafana on the same disk that failed hence losing that too.

This discussion has been great but it has got my wondering about to the diy vs Apple approach vs the global reach of this awesome software.

Given so many ways to install, manage, backup, report the market reach is diluted to those of us ‘enthusiasts’ that 1. Have enough knowledge and 2. have enough time to learn and build our custom solutions.

I wonder if there is space for a controlled eco-system like the Apple ethos for the non-enthusiast. There is already HA Blue so that is a hardware/software pre-built environment but what if nabu casa also took care of many other features like full system backup, influx and grafana reporting etc etc. The kind of service whereby everyone, independent of geekiness, could have HA in their home just as they do Alexa or HomeKit. Obviously the IT literate of us could build and run as we see fit as we do now but for many others, then a box drop experience would be a very attractive proposition. Especially with Matter now coming making component commissioning easy and if ESPhome does get Bluetooth commissioning (thanks Jessie) then adding hardware should be a walk in the park.

I love HA, but if I think of 100 random people I know - the technical know how of even how to install HA is out of there reach let alone yaml configs and back up scripts etc etc.

Think of Android and Apple phones. One is infinitely hackable while the is impossibly hackable (while still being able to perform perfectly) but there is a market for both and, thanks to that, smartphones have saturated mobile phone space.

I think there HA could play in both spaces. Just needs some thought put into the controlled environment and some 99.999 reliable hosting support. Maybe Paulus and the team are into that already