How to debug HassOS going offline overnight?

Hi everyone.

I have a standard HassOS on Raspberry Pi 4 installation which seems to be going offline overnight. When I check it in the morning, it’s not pingable, and the only way to get access to the system is to power-cycle the Raspberry (which I’m aware is not good).

My question is, how should I debug this? I’ve checked all the logs in the supervisor tab, but they only seem to log up to the point of the last reset. Is there any way of checking what happened before the Raspberry was turned off, like a persistent log file?

Thanks in advance.

No, unfortunately the logs are not persistent across restarts.

Before restarting, you can put the SD card in a Linux PC to read the logs, or a Windows PC running DiskInternals Linux Reader (free).

1 Like

I see, I will try that. Thank you!

I had to wait for HA to break in the same way (it worked fine for a few days) to try this. Last night it finally did, so I tried this approach.

The home-assistant.log file shows nothing of interest. I couldn’t read the host logs - I found a bunch of .journal files, but they seem to be binary. They cannot be read without journalctl, am I right?

One thing that has been happening was that at random times while using the UI was that I would get a “read-only file system” error, which seems to be a symptom of a dying microSD. That might be at fault for the overnight crashes as well. I’m going to replace the card (since this one was a random one that I had around, and is probably not best suited for the job) and see if that fixes the issue.

Thank you for your help!

check this threads HA stops work every Monday

I believe the problem lies on your microSD corrupted and HA will check it every Monday night or something like that

1 Like

Thank you! Yes, I would say that the card has something to do with it. I will order a better card, and in the meanwhile, check if the crashes are exactly one week apart.

Better yet change it to USB SSD or if you want the best then get your self ArgonOne M2 <- so far is the best casing for RPI4.

1 Like

@rousveiga in the thread suggested by @kingrichard you will find a number of instances where Raspberry PI’s crash sunday night after midnight. In this github you can find more info.
HA stops work every Monday · Issue #47928 · home-assistant/core (github.com)

After replacing the SD card with a new one the crashes have stopped (for now?).

Thing is, we still don’t know what process triggers the crash. Something is happening at sunday night just after midnight that somehow uses the SD card to it’s limits (and over).

1 Like

Thank you! My issue seems to be slightly different to the one those people are experiencing; my Pi goes completely offline (can’t SSH) and the crashes happen seemingly at random. However, the debugging advice in that thread seems very useful, so I’ll try those methods to get more information that might help diagnose the issue :slight_smile:

Replacing the SD helped in some way; the crashes seem to happen less frequently, but they didn’t disappear completely.

Ok, hopefully you can find the cause. Keep us posted in case if anything interesting…

I increased the log level to debug, and it crashed again tonight.

I read the home-assistant.log from my computer, but it doesn’t shed any light into the issue. The very last entry seems to be a very normal entry about updating a template sensor:

2021-04-19 21:23:10 DEBUG (MainThread) [homeassistant.helpers.event] Template group [TrackTemplate(template=Template("{{ states('binary_sensor.sensor_movimiento_despacho_javier_motion') }}"), variables=None, rate_limit=None), TrackTemplate(template=Template("00:{{ states('input_number.timeout_ocupacion') | int }}:00"), variables=None, rate_limit=None)] listens for {'all': False, 'entities': {'input_number.timeout_ocupacion', 'binary_sensor.sensor_movimiento_despacho_javier_motion'}, 'domains': set(), 'time': False}

I will try to read the journal files once I get access to a Linux system with journalctl.

I made some more progress: saved the .journal files from the last crash, enabled host access to my Pi and read them with journalctl.

Everything is normal until:

Apr 19 19:22:27 homeassistant systemd[1]: systemd-udevd.service: Watchdog timeout (limit 3min)!

The line before this one is a very normal one, Bus:Handling <Event state_changed[L]. It has to do with a sensor that updates every second - could that be possibly overloading the event bus?

What follows is about 500 lines of what seems to be a shutdown process - of which some states seem to fail as well… I posted those here. My Linux knowledge is not that deep, so I can’t really understand a lot of it, but at least now I have specific errors that I can search. :slight_smile:

Edit: seems to be related to homeassistant crashing every day since an update in december · Issue #1232 · home-assistant/operating-system · GitHub and HASS unstable · Issue #1119 · home-assistant/operating-system · GitHub.

I downgraded the OS to 5.3 (I had experienced the issue in 5.12 and 5.13) and it’s been working flawlessy for a few weeks now.