Backing up and restore - the very much imperfect experience

Tim_Stubbs · May 12, 2023, 12:05pm

Preface: I completely love HA, I’m running RPI4 4GB with an SSD, Zwave, Zigbee sticks, HASSOS

I wanted to share my somewhat horrific experience with backups - it’s not been fun. I’ve used backups (snapshots) for some time along with the google drive addon which makes me feel secure. My experience of late has made me feel the opposite.

So my SSD seemed to be dying and I needed to swap it out - for any other OS I’d just image a drive. Of course - linux - so janky tools mean this becomes not so fun. Clonezilla (as I found out) doesn’t like resizing partitions down (even if they aren’t full and would otherwise fit). I can’t stress this enough - if you want to image via this method you are wrong - go instead for gparted and just copy partitions. It’s the only thing that really works. Don’t be fooled into thinking gparted resize + clonezilla will work - you will waste hours.

Why do this? Well my SSD is dying and I only had a larger drive waiting around. So I move to that and of course end up with a very large data partition which now won’t fit onto the smaller replacement (new) SSD when it arrives.

But I’m getting ahead of myself - at this point you’re all wondering why I don’t use the RPI imager and then just restore a snapshot (backup) - well of course I did.

But it doesn’t go well.

In theory, backups restore everything - just image with the RPI imager, logon and restore a full backup. Same hardware, same SW and same config - should just work. Should.

So what went wrong (some samples)?

Restoring backups hangs. This happened a lot. Hours pass, things get stuck - you’re spending ages SSH-ing to see if things are progressing or even alive. There’s no progress indicator, no real heartbeat to monitor. Stress increases, relationships break down, hair is lost. Please improve this. Even via the CLI you just sit looking at some spinning ASCII - it doesn’t mean it’s doing anything. I gave up a few times. I couldn’t tell if things worked. I ended up using the CLI as I stopped trusting the frontend.
I found at one point it was good idea to wax my DB - i.e. SSH in and stop core and nuke the DB. This seem to improve stability and restores worked a bit better.
I’m using Zwave JS UI but all my Zwave devices don’t restore. They’re unreachable. I spend ages trying to figure out why and eventually notice my Zwave JS integration is busted. Why? Because for some reason I need to remove and readd it to put the Zwave JS UI URL back in during initial setup. One problem solved.
All my HACs frontend stuff is missing and not loaded. Fix is to redownload them one by one. Why? No idea.

So you’re gonna say: well you’re using community stuff etc etc. Fact is the process is not pretty and whilst I see the backups as being essential I still feel the need to image the drive. I can restore this (now I know how to do this) with drive imaging - perfectly. It works right away. Nothing needs to be fixed. This is how backup restoration should work.

HA / HASOSS would be strongly augmented by a drive imaging solution. Even better if this could push the images to a network share - this is essentially how everything else in the house works and it’s super reliable and super fast at getting me back to a running state compared to everything else. Even in my now working state a restore takes 30mins+ and then I’ve got n mins/hours of figuring out what’s broke to contend with.

BTW if you know of a reliable windows imaging tool that works for HASSOS backups please let me know. I’ve not tested Macrium or Acronis yet but I’ve moved to Macrium for all my windows systems (as Acronis is now sadly bloatware). By this I mean I unplug the drive and stick it on a Windows system to backup.

Sorry if this came across as a whinge but my experience has been that posting stuff like this has helped me find my solutions to my problems so in some part it may help someone else. I am up and running now despite dreading upgrading to the latest core version as my next step. I’m on 2023.4.6 and about to try 2023.5.2 today. Wish me well…

Rofo · May 12, 2023, 12:10pm

Interesting. I’ve had a very different experience recently.

When 2023.5.0 came out, it created instability in my ZHA network, such that I tried various fixes, and have had cause to rollback to 2023.4.6 two or three times during my fiddling around.

I also use the google drive automated backup addon, and for me, every time I restore from a backup the process has worked flawlessly, and usually takes around 10-20 minutes.

I’m running on a virtual box VM with a windows 11 host, with an i5 7500t processor underneath, dedicating one of its cores to the VM. Physical storage is SATA SSD based.

I also have lots of HACs and Addons, all restored fine everytime.

Neil_Brownlee · May 12, 2023, 12:15pm

Like the previous poster - I’ve had no issues and I have a huge Influx DB as well to deal with so I tend to look at 30 odd minutes for a complete restore.

Last time I had to move from SSD to SSD I used Balena etcher. However next time I will just install a blank and apply my backup (I did this once before but on SD Card)

Good luck with the update - personally I would wait for 2023.5.3 as they are some oddities with Riemann sum integrals at the moment.

Tim_Stubbs · May 12, 2023, 12:24pm

To qualify - for some years I’ve been OK with snapshots and had no doubts on rolling back. The problems I encountered though weren’t unique - the HACs one I found a post on for example which helped me, and the same for the ZWAVE JS issue so they aren’t unique.

Rofo · May 12, 2023, 12:25pm

2023.5.3 have defo screwed up some ZHA devices.

Looks like a fix is incoming for 2023.5.3:-

github.com/zigpy/bellows

HUSBZB-1 config regression fix for Aqara devices

zigpy:dev ← puddly:puddly/husbzb1-config-fixes

opened 05:10AM - 12 May 23 UTC

puddly

+97 -24

#550 introduced a major regression for older coordinators. Below are the previou…s EZSP configurations: ```python v4 set CONFIG_ADDRESS_TABLE_SIZE = 16 v4 set CONFIG_APPLICATION_ZDO_FLAGS = 3 v4 set CONFIG_END_DEVICE_POLL_TIMEOUT = 60 v4 set CONFIG_END_DEVICE_POLL_TIMEOUT_SHIFT = 8 v4 set CONFIG_INDIRECT_TRANSMISSION_TIMEOUT = 7680 v4 set CONFIG_KEY_TABLE_SIZE = 4 v4 set CONFIG_MAX_END_DEVICE_CHILDREN = 32 v4 set CONFIG_MULTICAST_TABLE_SIZE = 16 v4 set CONFIG_PACKET_BUFFER_COUNT = 255 v4 set CONFIG_PAN_ID_CONFLICT_REPORT_THRESHOLD = 2 v4 set CONFIG_SECURITY_LEVEL = 5 v4 set CONFIG_SOURCE_ROUTE_TABLE_SIZE = 16 v4 set CONFIG_STACK_PROFILE = 2 v4 set CONFIG_SUPPORTED_NETWORKS = 1 v4 set CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE = 2 v7 changed CONFIG_END_DEVICE_POLL_TIMEOUT = 8 (was 60) v7 changed CONFIG_KEY_TABLE_SIZE = 12 (was 4) v8 set CONFIG_TC_REJOINS_USING_WELL_KNOWN_KEY_TIMEOUT_S = 90 ``` EZSP v7 renamed `CONFIG_END_DEVICE_POLL_TIMEOUT_SHIFT` to `CONFIG_END_DEVICE_POLL_TIMEOUT`. The earlier PR missed this and the end result is that child devices that poll very infrequently (i.e. Aqara) will be aged out quickly by the coordinator. --- https://github.com/home-assistant/core/issues/92581

giqcass · June 8, 2023, 4:34am

Restore was a massive mess for me! I’m considering switching to Docker.

Tim_Stubbs · June 18, 2023, 11:55am

I’m having similar thoughts - a hosted solution would allow for true backups and zero downtime in the event of problems.

dominikandreas · July 23, 2023, 7:13am

@Tim_Stubbs you’re speaking from my heart. I just migrated from a raspberry pi to home assistant yellow and the experience was absolutely dreadful. The backup system is great when it works, but when things go wrong (which they always to at some point), there’s no indication of why. In the frontend you just have a spinner which keeps spinning regardless of problems and there’s no way to inspect the actual restore process.

For me there must be something broken, whenever I try to restore a backup, supervisor seems to do something and then just hangs. No error messages, nothing in the docker logs. As a devops guy with lots of linux and docker experience as well as being with home assistant from the start, it’s seemingly impossible to even find out the reason, let alone fix or work around it. This is a serious issue and needs to be addressed. Would probably be a good idea to create an issue in the github repo to raise awareness of the developers. Reliable backup and restore is just too important for a system like home assistant

AlfredoSola · October 6, 2023, 2:30pm

Count me in. Home Assistant on venv working fine on a rPi3. Back up (a minute or two), download (had to use scp, as the app won’t download or error) and on restore to a fresh Yellow, half an hour (and counting) of spinning icon.
I think this needs to improve, especially for non-technical users. Me, I can configure the whole thing from scratch, no problem. But someone who is not an engineer and just needs to recover their HA will be utterly lost and stressed by the experience, as no option is ever offered.

Tim_Stubbs · October 16, 2024, 9:20am

FWIW in the end I migrated off my Pi and onto an N100 box. I took the opportunity to host via proxmox and now have automated snapshots running of HA - OS and all. Upgrades no longer worry me - failure is a few minutes to revert.

I’m still using backups and the GoogleDrive addon - it’s just that now I have the surety of a working (and portable) system in the event of any sort of failure.