HA regularly becomes unavailable

I have a regular issue with my HA instance becoming unavailable which is frustrating as it needs a core restart to get the instance back up and running again. Whilst in this state, automations do not trigger and integrations such as Z2M do not work, effectively crippling my house.

I’m trying to diagnose what is causing this but I don’t really know where to start as this usually happens in the middle of the night or whilst we are out. I have a theory that it is the Synology integration causing this - I use this integration to start and stop the NAS based on occupancy but I suspect when the NAS is turned off, the Synology integration is doing something which then leads to HA failing.

Where do I begin trying to diagnose this issue? Once HA crashes/becomes unresponsive, there is no way to access to the logs so where do I look? Are there any pointers for how to track down exactly when HA becomes unresponsive.

I wonder also if there’s a way to monitor HA externally and trigger a core restart if it crashes but this is a sticking plaster rather than a fix.

Any advice would be greatly appreciated.

I’m interested in this topic as well. I’ve had some random crashes that I’m investigating. How I’ve been able to tell when the hub starts going crazy is by looking at my graphs.

It allows me to retrace steps to figure out approximately when it started having trouble which allows me to dig in deeper to that timeframe.

Mmm that’s a good point however what I’ve just seen makes this even stranger. There are no “flatlines” in my graphs, which suggests parts of HA were still operational but others weren’t or that I restarted HA shortly after it went unavailable.

I can see all my Zigbee devices going offline at around 08:02am and that’s about the time I spotted the UI was unresponsive and automations (mostly for lighting/Zigbee devices) were not working.

I am running my HA instance in a virtual machine running on Proxmox 7 and there was no obvious spikes in resource use in the past 24 hours but I’ve had two crashes/unresponsive UI instances in that time.