Since about 6 weeks I have a strange issue with my HA OS installed as a Virtual Machine in Proxmox that was running rock solid for 2 years.
It take me a bit to “understant” what was going on, as I had several containers running on the same Proxmox.
During the “Troubleshooting” I landed to see that it’s the HA Virtual Machine the reason of my Proxmox crashes, If I don’t start HA, the server is running great without problems.
The problem is quite strange, when I start HA it could work fine even for one or two days then it simply die crashing proxmox and the only option I have to recover it is to power-cycle the server.
After some time I realize that there are in particular 2 things that let it crash 90% of the times:
Running a backup
Updating Core or a big integration
For example, updating Core to 2025.2.3 take me about 20 attempts.
I start the update and after 2 minutes HA become unavailable and Proxmox die. I have to power-cycle the server and try again.
Smaller updates most of the time works, for example if I update a simple integration it works. But if after succesfully updating an integration I run another bigger update the system crash and when I power-cycle, then that same small update has still to be performed. If after the succesful small update I immediately reboot, then the update is really done.
Anyway updates are not the only time it crashes, sometime it simply die without any interactions after a few hours or even a couple of days.
I disabled all integrations except DHCP Server and Zigbee2MQTT
This unclear problem is driving me crazy and I start suspecting this has something to do with some hardware failure and not really with HA itself. I know I said it’s happening only with HA machine but it’s also the ‘heaviest’ I run (Maybe with the exception of NextCloud) and it is the only virtual machine, the others are all containers.
What hardware is this running on? It sounds like a possible hardware issue (bad RAM?) Can you remove/swap the memory or is it soldered onto the board? What does your System Monitor show for memory and processor usage over time? How large is your database/backup file?
Hello, an update: I restored the Proxmox backup on another storage and it’s now working great again !
The strange thing is that the storage that was causing the fault was hosting all my other containers without any problem. By the way, it’s fixed now
Hello, I don’t have the issue that the Proxmox server stops working but all the rest you tell is the same but it’s my VM which stops working and I have to startup the guest, also a hardware problem or more a memory issue but when it’s updating it’s not using much mem or cpu.
Your memory value is just a picture of the moment and only of what is being used at that moment.
You can not see what is being used a second later and neither what is being requested.
The crash occur when a request can not be fulfilled and that request can be problematic if it request more than is available, but also if it request a continuous block of memory large than is available.
The second one is the hardest one to deal with and only an ample surplus of available memory can truly counter it.
I am having that issue too. HA keeps running but the NIC the VM is using crashes to the point the port doesn’t even show up if you run ‘ip link show’. This doesn’t always happen so at times I can use my JetKVM to log in and run ifdown and ifup on the NIC used by HA. Anyhow the logs show the e1000 hang error. So… I moved HA from the 1GbE embedded Intel NIC to an Intel X520-DA2 2 port SFP+ 10GbE NIC but the same thing happens even though it doesn’t rely on the e1000 driver. Frustratingly I’ve had this issue happen before when running Proxmox 8, and had found some way to resolve it (turning off features on the NIC) but that fix doesn’t seem to work now (or I am missing something).