Home Assistant Crashing Frequently (RPi4)

Hi All-

Sorry that my first post is a problem enquiry.

My HA install has been crashing for about the last month or so, more and more frequently. I have not been able to pinpoint anything, so far. I’m looking for ideas to help isolate the cause…or alternately, a way to recover the system without starting over from scratch. I recognize that I haven’t provided enough information to SOLVE the problem, below. But given the nature of the problem, conducting experiments a collecting data has been difficult. So, again…I’m looking for suggestions of things to try, places to look for data I might not be aware of…etc

A bit about my system:
I’ve been running HA on an RPi4 (8GB) for a little less than a year. I have the RPi in a Argon One case with heatsinks, and a fan running at full-speed. The CPU temp never gets above 125F. The ARgonOne case has a M.2 SATA to USB3 slot, and I have a 256GB SSD installed as the boot/data drive (no SD-card is in the slot).

Devices:

  • TP-Link Kasa switches,
  • Treatlife switches flashed with ESPHome,
  • Zigbee plugs,
  • and Aquara door/moisture/vibration sensors.
  • My Zigbee radio is a Nortek Zigbee/Z-wave stick using ZHA (I don’t have any Zwave devices).
  • I also have a few google devices: chromecasts (TV and audio), nest speakers, and Home Minis, and a Vizio Smart TV.

Add-Ons

  • ESPHome, File Editor, Good Drive Backup, Moquitto, Samba, Studio Code Server, Terminal & SSH, ZWave JS, NodeRed
  • ArgoneOne Cooling
  • Unifi Network Controller
  • Assistant Relay
  • Tailscale
  • Network Ups Tools

Automations
Most of my automations are in NodeRed, I have a couple minor ones still in the UI/native engine. I don’t have any blueprints.

Symptoms:
The system crashes randomly. Often within a few hours, sometimes minutes, sometimes a day or two (rare).

I have seen several instances of SQLite database corruption errors in the logs, when I can see them. Often, it takes two reboots (power-cycle, system restart) before the system will run long enough to inspect the logs…so, many times I can’t get to see the homeassistant.log.1 file before it gets over-written by another reboot.

Most of the time, when I can get in to look at the log there isn’t anything obviously interesting in the last few moments before the file ends. The SQLite corruption is most often in the next startup, possibly because of crash / power-cycle while the database was mid-transaction.

I’ve been working to disable add-ons or integrations that I don’t use much or can easily be enabled/re-installed without a lot of extra work. And, see if the system stabilizes. Eg: NodeRed is disabled…so, all my automations are no-longer running. I’ve also been correcting any “ERRORs” that I did see in the logs to eliminate them as a source. So far, nothing has made any difference.

Because this came on subtly / slowly, I’m not sure exactly when it started…or what I may have added that initially caused the crashes.

Thanks for listening…

1 Like

Random crashes are almost always hardware; the issue is proving that a crash is actually random and not just some factor you’re not monitoring! :face_with_symbols_over_mouth: :frowning:

In normal times, I’d suggest getting a spare RPi to create a pre-production test bed (careful with mDNS names) for soak testing, but getting a hold of one is tricky at the moment.

There are disk stress-test tools, but that would likely require cloning the M.2 disk and getting another USB dock - the issue then is which set of old/ new parts to use in a test.

Once a month is frequent enough to be a PITA, but not enough to be easy to instrument. If you have the resources, create a new set of hardware (RPi, M.2, dock, etc) and move production to it - throwing cash at the issue won’t find it, but would tell you what it isn’t!

Personally, I’d set up automatic backups with a SCP copy somewhere else so a bad crash doesn’t loose too much. This could be extended to copy key log files many times a day for more evidence. Another useful tool would be a ‘watchdog’ on another machine regularly checking HASS is working - combine with an alert, you might get more context of the failure.

If you’re adding load and disk activity by lots of backups, a storage issue on the M.2 might cause more crashes.

It’s never easy to find similar issues - and on-site troubleshooting was my specialism for a while. You have my sympathies!

To clarify…it happens every DAY. Sometimes more, rarely less.

I do have daily backup setup to Google drive. And have full snapshots going back to early August.

As it happens I DO have a spare Rpi4. I also have a spare PC sitting around. That’s a good suggestion. I could restore from a backup, and shut the production system down and see what happens. If the restore image works ok, then I can troubleshoot the hardware further.

1 Like

In my experience, random crashes with pi’s are down to power supply. Been running pi’s for years 24/7 and random crashes were always power related. What power supply are you using?

1 Like

IMHO, your ‘spare PC’ idea is an easier shotgun test as it rules out ALL hardware.

But my bet is - I think @Arh is probably right… It’s almost always the SD card in a Pi, because you don’t have a SD card installed - power supply.

I would suspect the USB3 disk. Power supply? usb port suspend? possible especially with large ssd disks (higher demands in power).

I’m pretty sure its the power supply that came with the ArgonOne case. But, its non-specific chinese branding on the back. And its been 6+ months since I setup the case. It is a 5V / 3.5A power supply with integrated switch, and “heavier” gage cable (ie, its not a generic USB phone chager).

Nevertheless, another good point I’ve had issues with CanaKit supplies on other RPi in the past, too.

I used the spare PC last night and restored from yesterday’s backup. I turned all the add-ons, integrations, and automations back on. Its been rock-solid for the last 22 hours or so.

1 Like

Install LibreElec on the RPi and test the kit by watching lots of videos? :grin:

There is an official disc test tool, but not sure if the check is more for validation than hardware testing:

sudo apt update && sudo apt install agnostics

My guess would also be storage or possibly PSU. Get a meter on the power rails and the M.2 PCB.

Symptomatically, it sounds like I’m having the exact same issues. No Argon in the setup, and different branded hardware. RPi4, with SSD (but on USB2 to avoid zigbee interference). PiHat zigbee. A couple of USB serial devices. Only a couple of months old, previously used SD cards but have had a few crash and have subsequently upgraded to SSD only.

Everything was fine until I turned off the UAS quirk on the SSD (as this seems to be a differing bug where no supervisor or add-on logs are visible when using an SSD on USB). These logs are now available, but don’t show anything much useful (except as add-ons crash) see [Supervisor Logs Extremely Slow to Open]

But… Since then I seem to have daily or every second daily crashes, sometimes quick and cause the system to restart, sometimes it seems to be slow, one add-on at a time, and database corrupting. Wiping the database manually allows the system to start properly again for a day or two. I’ve done the whole limiting integrations/add-ons one by one, but it seems to be a different causative agent each time. I’ve also tried limiting the recorder history to just a handful of individual sensors I need for automations, but it hasn’t helped. I turned zigbee device polling down to 60 minutes to limit mqtt database growth. Tried MariaDB and still crashes. I have noticed though, even with the recorder limited to a few sensors only, database size gets to 5 or 10GB in a day or so. No single sensor seems to be more than 1 or 2% of the total database size though, so not sure what the single reason is?

I’m going to remove the UAS quirk (foregoing the log availability) and see if that fixes the operational crashing, but I’m not hopeful. Only other option was an SSD failure, already…

A few months ago mine started crashing several times every day at random, its an RPi4 with conbee 2 and RFLink, having been stable for years and no hardware changes.

Tried a ton of stuff, but turned out to be a power supply fault. Worth having a look at binary_sensor.rpi_power_status. I brought a genuine RPi4 power supply and it fixed it.

2 Likes

I am having the exact same problem, using a RPi4 in a Argon One case. Could you tell me how did you solve?

Are you using the original power supply for the Argon One?

No. I’ve bought it without a power supply. I am using the one that came together with the RPi4. Furthermore, I’ve just integrated the binary_sensor.rpi_power_status sensor to the Home Assistant to check. Should I buy an original power supply?

Not necessarily an original, but you need a beefy one. See if you can find either the original Argon One or one that will reliably deliver up to 3Amps

Pis are power hungry creatures as it is. They easily outpace a generic or borderline 2.5a usb power supply… Add in all the electronics in the Argon One (i use one) and it’s fans… And an SSD… And multiple usb dongles… Its not a problem when you’re driving a weather station and it’s SD card.

You’re not running a weather station :wink: any voltage drop beyond a certain threshold and who knows what happens. Usually with a hard crash…

Thanks a lot. I am going to do that!

Sorry…I didn’t see your question. I don’t log into here all that often. I replaced my power-supply with a genuine RPi PS.

I wouldn’t really call Pi’s power-hungry…but, they do require a lot of current at 5VDC (3 Amps is still only 15 watts). The 5VDC with 3 Amps of current, makes them susceptible to minor voltage drop causing low-voltage / brown-outs. There just isn’t much headroom to allow for more drop. Its a trade-off between the convenience of USB and using a proper power-supply.

My Problem was the following.
HA crashed / freezes from time to time. Unknown when.
I got rid of the problem by using the USB 2 for my external SSD. And now the problem is gone.

I have almost the same issue, HA crashes nightly with SQLlite → recorder → Supervisor issue. The network adapter became unresponsive, it’s impossible to restart HA using UI, SSH connections refused, so only a hard physical reset. The main problem that I’m 3000km away from my home in Ukraine and I have to ask someone to come and reset electricity switch. :sob:

To be honest, I don’t know what to do with this issue. The hardware are the same. Argon One, RPI 4 8Gb, M.2 SSD.

Its usually the power supply. Even if you started with the recommended one its still probably the power supply, they do go bad sometimes. Ive been running HA on a pi4 4gb with ssd for 3+ years and its been rock solid. So whats going on in Ukraine? Whats 100 billion of our taxpayer money getting us over there?