Diagnosing a recurring issue with HA becoming unresponsive

Hi,

I wonder if anyone could help me or give me some advice on diagnosing an issue with my installation please?

It’s been solid as a rock for three years, but in the last few weeks maybe 2-3 time a week it’s become inaccessible, often when I wake up in the morning from overnight. Cannot login via the web or app, it just sits there trying to load indefinitely.

Some automations seems to still be running as far as I can tell, but most are not.

It’s running as a VM on Proxmox on a little HP Elite desk 800 G2 mini server, which has plenty of resources generally (and an SSD) .

Restarting the VM or even the host! doesn’t seem to work and I literally have to power cycle the host server to get it back up.

As far as I can tell there’s nothing obvious in the logs and I don’t remember a particular change I made around the time it started happening that could be the cause. The only error I see sometimes in the logs around the time it occurs is something to do with my Bluetooth adapter - “bluetooth: hci0: malformed le event 0x02”, not sure if it’s relevant or related.

CPU and Memory usage seems fine in Proxmox as far as I can tell, but it does feel like something is getting stuck, tieing it up and slowing it down so much it’s unusable.

I’m concerned it’s getting worse and if I’m not home when it happens I have no way of recovering it until I am. So any help or advise on diagnosing the issue would be much appreciated!

Thanks.

See this topic:

1 Like

Thanks I have enabled debug mode and will see if that provides any clues.

However to make matters worse since I restarted it ZHA now refuses to start, just constantly says failed setup. Argh!

I hope I’m not inappropriately hijacking this thread, just trying to contribute to an existing issue instead of opening new threads for the same or at least very similar problem.

I’m also running into the issue of, after running reliably for more than a year, I am now randomly getting system hanging issues at random times, maybe once every week or two.

I don’t know if my issue is caused by an intergration or by what really, so I’m looking for ideas on how to go about narrowing down where the issue may be coming from.

Today this happened again and I was able to capture this history graph which makes the symptom pretty evident:

Apparently HA rebooted itself after about 30 minutes of being unresponsive.

In my case I’m running the latest (as of now) HA on HA OS. It’s running on a Raspberry pi 3 with an SSD. Normally very responsive and stable, until this happens.

I’ll enable debug mode and see if that says anything.

The previous time this happened, I was able to SSH in and could notice clearly that a lot of CPU was being spent in IO wait, but I couldn’t narrow down from there to which process was causing all that IO activity. Any hints in this direction for the next time around would be also useful.