So I rebooted my HAOS instance (hosted on Proxmox) today and it didn’t boot - it sat at the EFI menu.
After some investigation, it appears that I have no partition table. More specifically, I can see from my Proxmox backups that it was present and correct on the 19th March, but not present since.
If I boot my HAOS to a Linux rescue disk and inspect the first few bytes of the disk device I see this: -
It turns out that the 1st 512 bytes of my disk are in fact, no longer a partition table, but some css (specifically it looks like Node Red code!)
I know I updated Node Red a couple of days ago, so does anyone know why or how the update of components within HAOS would seemingly write directly to the disk device in this way?? It seems like a horrible bug if that’s what’s happened.
It’s possible this was caused by some block device weirdness in my Proxmox virtual machine setup which I’m asking separately over there but obviously investigating this option too.
It did also happen a couple of months ago - that time I had to restore to a machine a few days ago and lose a lot of config changes. This time I’m going to see if I can restore just 512 bytes of known good partition table (which rarely changes of course) to see if I can get it to boot - but of course depending on the root cause of the issue there could be other corruptions elsewhere…
I would call it a random disk error, watch the disk SMART data for continued errors, and restore from a pre-error backup and be prepared with a replacement disk.
Unfortunately, just because it’s apparently node red data there doesn’t mean HA ‘did’ it. There’s thousands of reasons that particular data could be in that particular sector and all of them point to something bad happened at the disk level - with root cause all the way back to and (potentially) including sunspots (yes literally). Without good logging of your disk metrics and controller data I doubt very seriously you’ll find the actual event… So watch for further indicators of disk issues (something really bad happened to overwrite the boot sector I’m really thinking controller error) using whatever tools you have to watch disk on that hypervisor (now that you know it’s something to watch) and keep very good current backups.
Yes I know you very much want to know why - sometimes an event just leads you to have better instrumentation for next occurrence. So how do you plan to monitor disk events in Proxmox?
So it appears to have been caused by a known bug in Proxmox 7 that allows sector 0 (partition table) to be overwritten during reset.
Thankfully I recovered my guest by restoring the partition table from an earlier, good backup - and now pretty confident there shouldn’t be corruptions elsewhere.