Tips how to debug system which is hanging

My system started hanging a few days ago and I need some ideas to help me debug please.

Raspberry Pi 4, with an SSD using the Agon One M.2 case and Kingston 256M SSD, USB Coral (for Frigate). Genuine RPi 3A power supply.
Initially, I was two or three releases old, but now I am running the up to date versions of everything, core-2021.8.6, supervisor-2021.06.8

The system hangs and the only thing I can do is get a ping back.
http://homeassistant.local:8123/ does not respond
SSH does not respond on 22 or 22222
http://homeassistant.local:4357/ does not respond
SAMBA does not respond
Connecting display and keyboard does not respond
=> It really is a brick and I dont know how to capture the cause.

But it comes right back up after a power cycle and everything runs OK for a random time up to about a day. But power cycle also clears the logs. Is there a way to keep the logs between boots?
(My logs are generally very clean, no red flags before it hangs).

When this started happening it had been a week or more since I touched the configuration, or done any upgrades or anything (as I’d been away). The only change I have made since is to upgrade to all the latest released versions but this has the same behaviour.

I swapped out the RPi4 => No change.

Next I am going to remove the Coral and Turn off Frigate, but they have been runing for six months so I do not suspect.

I could swap out the power supply, but the one I have now is a genuine RPi so I do not suspect it unless there is an issue with the extra SSD needing more than 3A peak?

Then maybe I’ll try and run back off an SD rather than SSD. SSD has been runing reliably for about 6 weeks. I guess I can make a backup, write an virgin SD card, remove SSD, restore back up to get the same system on the SD as I have today.

Any other ideas or experience appreciated.

I am having nearly the same issues, I can ping, and samba is working, I can see the folders and contents but I cant SSH or connect via browser.

The automation’s are working.

If I power cycle (which I dont like to do} then it all comes good.

Not sure what to do?

Re Dave

I seem to have the issues mentioned; trying to find an appropriate place to activate a thread and although this is old but I want to respect the forum and not start afresh.

RPi4, 4GB RAM, external 128GB SSD, Aeotec Z-Wave Stick 5+. About 60-70 devices. Good RPi PSU.
Running approx 8% disk and 30% memory. I’ve been through 3-4 MicroSD over time.
All packages are current, but for 1 release as below.

Runs fine for days and even weeks. Then it doesn’t. When it goes bad, it’s usually still responsive at the HA level - GUI works, controls/helper toggles work, but Z-Wave devices go unavailable and grey out. When I then go to look at logs to dig, it usually breaks even more at that point, and the whole thing collapses.
As soon as I power off and on, it fully works again.

Very subjective observation: It seems that it misbehaves when there are updates pending. I try to stay current, in fact I have to because of this - I’m almost always having to apply updates within the week after they appear because it will de-stabilise. HA Core 2023.11.3 is pending today, currently running 2023.11.2

Since I can’t get to logs, I don’t know where to start. Today, I could not get to ssh/22222 though I previously enabled it - after a few reboots to recover repeated hangs, I set that up again via CONFIG & authorized_keys. Now that I connect, I can’t see issues and not sure what to look for, or where to find logs from previous boots.

Would really appreciate some pointers. I really like HA, but it can go unstable at any moment leaving me concerned for eg vacation, and I’ve had to install an internet power switch in front of it!

Thanks - Dave

I’m going to assume that you have already gone through the basics:

I too run a RPi4 and it has been stable for years. A few questions:

  1. How do you know that the PSU is good?
  2. Are the SD cards as recommended here?
  3. Could the SD cards have been corrupted (from power fail, etc.)?
  4. I have read that the PSU might not be able to supply enough power for other USB devices plugged directly into to RPi4 - have you tried using a separate USB hub with its own power supply (for the SSD and Aerotec)?
  5. have you tried to access the home-assistant.log, home-assistant.log.1 log files directly (not through HA but via File Editor or moving the SD card to another system to read)?

Or you could "Monitor your homeassistant-log …
Connect a monitor and keyboard to your Pi, and type: " login " hit enter
Then “find” home-assistant.log
Then " tail -F home-assistant.log "
Then watch

Many thanks for the quick replies!

I have probably a dozen Raspberry Pi’s, ranging from old B thru 4, and some Zero, Zero WiFi and more recently Pico and Pico W. I had one issue with a Pi Zero WiFi which frequently dropped off wifi, and I troubleshot extensively. Eventually I found a post that IIRC ‘prodded’ the wifi driver daily and it’s been rock solid for a couple of years now. Point being, I always use 2A & 3A PSU from eg Adafruit and other good sources - I’ve switched them around and never found my PSUs to be an issue. While “never say never”, I do believe this one to be ‘good’ - no suspicions.

I did not consider USB hub for SSD and Z-Stick for sake of adding power connection; I’d be a little reluctant to introduce another step/variable in connectivity.

WRT SD card, the card I’m using is Gigastone 16GB Class 10 UHS-I U1, I got a batch of 10, and am used them in other RPi with no suspicion.

Appreciate the troubleshooting and log pointers, I’d like to go that path to see if there’s a common trigger; maybe relating to hitting something like FS corruption.

Just an update to close out my post.

After continued investigation I was never able to understand the hangs, which continued and maybe eventually got a little more frequent. Logs just stopped logging, things just stopped happening; it was impossible to diagnose without infinite poke’n’hope.

I decided to replace the Rasp Pi 4 with an alternate tried and trusted hardware solution, and chose the Odroid N2+ with 256GB eMMC.
With quite a bit of trepidation due to the possibility of having to completely re-build my Z-Wave network, I found a free weekend and bit the bullet. Having had the Odroid running with no Z-Wave for a couple of months “burn-in”, I took a full backup, reloaded it into the Odroid, re-plugged my Aeon ZStick Gen 5+ and it came up with zero problems. It’s been rock solid ever since!

It was a huge relief and finally vindication of my gut feeling that HA is a solid platform, but I have to say I’m baffled as why the Rasp Pi 4 was such an issue. I’ve repurposed for a couple of things and it’s also been solid since.