HAOS completely hosed after 12.4 update with VMWare

madbrain · June 19, 2024, 1:49am

I’m using VMWare workstation Pro 17 on Windows 10.
I updated to 12.4 when I got the notification while in my hot tub. The HAOS never came back. I just came down to my home office, and the guest screen is just completely black. I rebooted the VM, and the same thing happened. Not a thing !

Fortunately, I do have backups. But seems it’s going to be a bit of a pain as I have to create a new virtual disk before I can restore the backup.

tom_l · June 19, 2024, 1:54am

Please report this here: Issues · home-assistant/operating-system · GitHub

madbrain · June 19, 2024, 2:01am

Looks like VMWare itself got hosed. It couldn’t power off the VM.
I had to reboot the Windows host. I manually restarted the VM, and it started coming back up. I now see the home assistant banner on the guest screen. And finally the web GUI came up. Not sure what HAOS could have done to cause this.

I’ll report to github here, but it seems this needs to go to VMWare too.

madbrain · June 19, 2024, 2:24am

Unfortunately, I’m unable to open a support case with VMWare. I get this obscure error :

https://knowledge.broadcom.com/external/article/211873/invalid-token-error-in-case-management.html

It looks like I need a Broadcom site ID. But since I’m not a paying Broadcom customer, it appears I’m not able to open a case, even to report a bug.

LeMoonStar · June 19, 2024, 7:57am

Same problem occurs for me on a TrueNAS-SCALE-23.10.2 VM. HomeAssistant OS starts, I am able access it with the virtual display, see a shell and accessing the web interface also works, however after a few seconds it seems to lock up.
After this lock up, the web interface is inaccessible and the shell is unresponsive.
I do have a snapshot from last night, so I’ll try to roll back.

LeMoonStar · June 19, 2024, 8:06am

Interestingly even after rolling back to 12.3 via the ZFS Snapshot of the VMs dataset - The system locks up after a few seconds. Checking the Shell output, the rollback was successful, and I was able to go to the web UI, and click “skip” on the Update to 12.4.
So it does seem to be a different problem after all?

madbrain · June 19, 2024, 9:25am

Definitely a different problem. For one thing, my host is Windows - and it’s not running the Windows port of OpenZFS

I was accessing the VM from the physical display of Windows machine, hooked to a KVM switch.
The VMWare window showed a completely black guest screen. And then, I tried to restart the VM, and I thought it had, but actually, nothing had happened. I tried to power the VM down, but it failed. The only option I had was to reboot the host. HAOS has been fine since then.

LeMoonStar · June 19, 2024, 9:34am

At first, I was assuming the issue was caused by the Update. Only later I found it doesn’t seem to be the same issue.
I have however been able to fix my issue by re-creating the VM - same drive, same everything. Dunno why that works, but it does.

I’ve read lots of people reporting that to be a solution to this kinda problem - with all kinds of different VM setups. No matter if Linux, Windows or VirtualBox, VMWare etc. The problem seems to be universal.

madbrain · June 19, 2024, 9:58am

It could be that all hypervisors had some bugs. I ran on Virtualbox under the same Windows box for a while. I kept getting database corruption over time. It would creep into the nightly tar backups, until eventually the purge would clear the DB at 4:11am. I could go back to a backup a few days before, but it would just happen again soon after.
I switched VMWare (first Player, now Workstation Pro) and haven’t had any database issue since. I would much prefer to still be using Virtualbox as it is open-source.
I don’t believe I had these DB problems when I used to run HAOS on a Raspberry Pi (3B+ and 4). However, it was much too slow for my taste, especially the network & disk I/O when backing up and restoring. We are talking nearly an hour start to finish. Whereas it is barely a couple minutes on the Windows host with striped NVMe SSD. It takes more time to restart than to restore.