HAOS on Pi3 hanging every few days, requiring hard power reset. How to troubleshoot?

HA has been hanging on me every few days where it no longer even responds to a ping. I have to hard power cycle it to come back up. This has been happening for about 2-3 months after what was a stable system for a year. Given that logs don’t persist reboot, how can I diagnose what is causing it to hang? I can’t find a way to stream syslog externally with HAOS.

I am running:
core-2021.5.5
supervisor-2021.04.3
HAOS 5.13
Raspberry Pi 3 with external bootable SSD

I migrated to zwave2mqtt around this time, but I’m not at the point that I can attribute the crashes to that.

I am only using 8% of disk space as per the host disk space status bar, so I am not running out of disk. CPU utilization <3%. RAM 30% on core and 6% on the OS.

How can I get more inform diagnose what is happening here? HA is practically useless right now due to the unreliability. I tried to get some assistance in discord a month ago but didn’t get much help on where to start diagnosing.

To get started with diagnosis, right now I have disabled Node Red to see if it is related to any of my automations. I didn’t make significant changes in there when the issue started, but since going to try and remove that as a variable due the amount of lifting done inside of it.

These are the addons I am running:

Well, looks like the hanging of the server is back even with all of my node red workflows disabled, so I doubt it is a node red issue.

I need some help here on tools to diagnose what is occurring.

How can I stream events and diagnose the source of this issue, when it becomes unresponsive to SSH at the point the issue occurs? Note: I do notice HA will be REALLY slow once in a while before the issue triggers.

I plugged it into a second monitor on my desk and watched it through the workday. It started to spit some errors out of the console.

Specifically a block error along the lines of:
blk_update_request: i/o error, dev sda sector ...

and then a bunch of:
lan78xx 1-1.1.1.1:1.0 eth0: kevent 0 may have been dropped

Decided to replace the pi3 with a pi4, as some comments I saw were around running out of the 1GB of memory. Another potential is the storage is failing, but it is a Kingston SSD only 1 year old, so not sure it’s that HW just yet. A reformat of the drive with the install may have helped.

So far it has been faster, and stable. Let’s see…

I’m having the same problem. I may have to do as you did and move to a rpi 4, but I’d like to diagnosis first. Usually, I have to pull power, and then the logs are gone. I’ve googled around, but it’s all a bit confusing. I increased the max_entries on the logs, will that work? Or will the logs still be gone with a restart? Is there a way I can retain the logs? I’m using core-2021.6.5, supervisor-2021.06.3, and Home Assistant OS 6.0, on a RPI 3.

I couldn’t ever find any way to get logs to persist. Very odd in the OS to not allow pushing logs externally via syslog to a server, at least.

I’ve been stable now though for over a week which is probably 3x any failure interval; between the HW change, SSD reformat, and power supply change.

If you want to know the problem, try hooking the console up to a monitor and either keep a close eye on it or setup something physical to take a photo of the monitor every couple minutes (as for me the console would start to print errors many minutes before going unresponsive).

You could try reformatting and recovering from backup, and if it persists, swap to another high quality power supply first.

Thanks. I unplugged it, started up a ProxMox box that I didn’t use, but had planned moving HA and other things to it eventually. Took about an hour to move the HA to that. Weirdly, I had to reload the HA config from the command line. The web interface didn’t work.

Just an update that I’ve been stable since this swap, on the same SSD, for about 3 months now. So it was either at the filesystem level fixed with a drive format (not confident on that since a fresh install/restore did not resolve the issue on the pi3), or the pi4 is handling better and preventing the issue. I am going to assume I was running out of memory. For context on scaling, I have 347 entities–234 of type sensor. About 30 zwave nodes and maybe 20 zigbee.

Having the same issue. Did not read this before, so that I opened this topic HA not reachable - how to troubleshoot?.