Home Assistant Crashing Frequently (RPi4)

IMHO, your ‘spare PC’ idea is an easier shotgun test as it rules out ALL hardware.

But my bet is - I think @Arh is probably right… It’s almost always the SD card in a Pi, because you don’t have a SD card installed - power supply.

I would suspect the USB3 disk. Power supply? usb port suspend? possible especially with large ssd disks (higher demands in power).

I’m pretty sure its the power supply that came with the ArgonOne case. But, its non-specific chinese branding on the back. And its been 6+ months since I setup the case. It is a 5V / 3.5A power supply with integrated switch, and “heavier” gage cable (ie, its not a generic USB phone chager).

Nevertheless, another good point I’ve had issues with CanaKit supplies on other RPi in the past, too.

I used the spare PC last night and restored from yesterday’s backup. I turned all the add-ons, integrations, and automations back on. Its been rock-solid for the last 22 hours or so.

1 Like

Install LibreElec on the RPi and test the kit by watching lots of videos? :grin:

There is an official disc test tool, but not sure if the check is more for validation than hardware testing:

sudo apt update && sudo apt install agnostics

My guess would also be storage or possibly PSU. Get a meter on the power rails and the M.2 PCB.

Symptomatically, it sounds like I’m having the exact same issues. No Argon in the setup, and different branded hardware. RPi4, with SSD (but on USB2 to avoid zigbee interference). PiHat zigbee. A couple of USB serial devices. Only a couple of months old, previously used SD cards but have had a few crash and have subsequently upgraded to SSD only.

Everything was fine until I turned off the UAS quirk on the SSD (as this seems to be a differing bug where no supervisor or add-on logs are visible when using an SSD on USB). These logs are now available, but don’t show anything much useful (except as add-ons crash) see [Supervisor Logs Extremely Slow to Open]

But… Since then I seem to have daily or every second daily crashes, sometimes quick and cause the system to restart, sometimes it seems to be slow, one add-on at a time, and database corrupting. Wiping the database manually allows the system to start properly again for a day or two. I’ve done the whole limiting integrations/add-ons one by one, but it seems to be a different causative agent each time. I’ve also tried limiting the recorder history to just a handful of individual sensors I need for automations, but it hasn’t helped. I turned zigbee device polling down to 60 minutes to limit mqtt database growth. Tried MariaDB and still crashes. I have noticed though, even with the recorder limited to a few sensors only, database size gets to 5 or 10GB in a day or so. No single sensor seems to be more than 1 or 2% of the total database size though, so not sure what the single reason is?

I’m going to remove the UAS quirk (foregoing the log availability) and see if that fixes the operational crashing, but I’m not hopeful. Only other option was an SSD failure, already…

A few months ago mine started crashing several times every day at random, its an RPi4 with conbee 2 and RFLink, having been stable for years and no hardware changes.

Tried a ton of stuff, but turned out to be a power supply fault. Worth having a look at binary_sensor.rpi_power_status. I brought a genuine RPi4 power supply and it fixed it.

2 Likes

I am having the exact same problem, using a RPi4 in a Argon One case. Could you tell me how did you solve?

Are you using the original power supply for the Argon One?

No. I’ve bought it without a power supply. I am using the one that came together with the RPi4. Furthermore, I’ve just integrated the binary_sensor.rpi_power_status sensor to the Home Assistant to check. Should I buy an original power supply?

Not necessarily an original, but you need a beefy one. See if you can find either the original Argon One or one that will reliably deliver up to 3Amps

Pis are power hungry creatures as it is. They easily outpace a generic or borderline 2.5a usb power supply… Add in all the electronics in the Argon One (i use one) and it’s fans… And an SSD… And multiple usb dongles… Its not a problem when you’re driving a weather station and it’s SD card.

You’re not running a weather station :wink: any voltage drop beyond a certain threshold and who knows what happens. Usually with a hard crash…

Thanks a lot. I am going to do that!

Sorry…I didn’t see your question. I don’t log into here all that often. I replaced my power-supply with a genuine RPi PS.

I wouldn’t really call Pi’s power-hungry…but, they do require a lot of current at 5VDC (3 Amps is still only 15 watts). The 5VDC with 3 Amps of current, makes them susceptible to minor voltage drop causing low-voltage / brown-outs. There just isn’t much headroom to allow for more drop. Its a trade-off between the convenience of USB and using a proper power-supply.

My Problem was the following.
HA crashed / freezes from time to time. Unknown when.
I got rid of the problem by using the USB 2 for my external SSD. And now the problem is gone.

I have almost the same issue, HA crashes nightly with SQLlite → recorder → Supervisor issue. The network adapter became unresponsive, it’s impossible to restart HA using UI, SSH connections refused, so only a hard physical reset. The main problem that I’m 3000km away from my home in Ukraine and I have to ask someone to come and reset electricity switch. :sob:

To be honest, I don’t know what to do with this issue. The hardware are the same. Argon One, RPI 4 8Gb, M.2 SSD.

Its usually the power supply. Even if you started with the recommended one its still probably the power supply, they do go bad sometimes. Ive been running HA on a pi4 4gb with ssd for 3+ years and its been rock solid. So whats going on in Ukraine? Whats 100 billion of our taxpayer money getting us over there?

I have been experiencing the same issues lately, but like you I’m unable to go home and check the physical hardware, the only thing I can do is asking my relatives to manually reset the RPi. After that everything comes back online like nothing happened.
It crashed Friday, Saturday, Monday and just less than two hours ago apparently, since the UI is unreachable and Wireguard stopped working. The timing is completely random and by looking just at the hardware info there is no sign of any bad process (CPU usage, temp, RAM and disk are all OK right before it stops recording the data).

This weekend I’ll be going home to see what is happening, does anyone know any way I can debug these errors? (I have a mini HDMI cable I can use to connect a monitor to the headless RPi, but I have no Idea of what to do from that point on).


This is my setup in case someone else is experiencing similar issues:
Hardware:
RPi 4 - 4GB
External SSD
Conbee II
USB connection to the UPS

EDIT: I don’t know if the hardware behaves well in cold tempertaures, my RPi is running in a room which has no heating, so the temperature can get to around 8-11°C (CPU temp is around 30-35°C)

Main Integrations:
Samsung TV, HomeKit bridge, yeelight, ZHA, iRobot, Tasmota, Shelly, HACS

Add-ons:
AdGuard, Vaultwarden, Nginx, NUT, Unifi, Wireguard, Google Drive Backup, MariaDB

Here is how you get to the logs post-crash to see what’s up

Cold is way better than heat unless there’s condensation involved.

Pis usually fail because of storage issues (worn out SD card, etc.) or voltage (bad /failing power supply) considering you’re already on an SSD. I’d look power first. - bring a spare known good power adapter when you visit your box…)

When I bought my SSD for the RPI4, 2 things came up in my research:

  1. SSD’s can be quite power hungry at times and the RPI4 usb power port is not exactly standard when it comes to delivering the needed power.
  2. Not all usb casings for SSD are compatible with the RPI4, some give trouble so be careful what you choose.

Both can often be addressed by placing a powered USB hub in between, so if you have one it is worth a try.

1 Like

Sooo, I tried SSH’ing in the RPi this morning after someone restarted it for me and I got a look at the system journal like you suggested.

After checking the latest crashes (apparently there were multiple), this is what I found:

Nov 29 12:42:27 homeassistant addon_a0d7b954_wireguard[584]:   allowed ips: 172.27.69.99/32
Nov 29 12:42:27 homeassistant addon_a0d7b954_wireguard[584]:   persistent keepalive: every 25 seconds
Nov 29 12:42:38 homeassistant addon_core_configurator[584]: INFO:2023-11-29 13:42:38,793:hass_configurator.configurator:127.0.0.1 - "GET / HTTP/1.1" 200 -
Nov 29 12:42:48 homeassistant systemd[1]: run-docker-runtime\x2drunc-moby-d907ccf591e0173a61dcbaa7d7d1c3ae71e3f46a43a6e597f7209537bc87b9e7-runc.8XrdA7.mount: Deactivated successfully.
Nov 29 12:42:48 homeassistant hassio_dns[584]: [INFO] 172.30.32.1:43168 - 39883 "A IN wlan0.local.hass.io. udp 37 false 512" NXDOMAIN qr,aa,rd 37 0.000464843s
-- Boot 679b3ee1f1794b4db750dcbf04daa5dd --
Apr 04 10:55:32 homeassistant systemd-timesyncd[549]: System clock time unset or jumped backwards, restoring from recorded timestamp: Wed 2023-11-29 12:42:59 UTC
Nov 29 17:50:51 homeassistant systemd-journald[134]: Oldest entry in /var/log/journal/6c2208e0ffb14c36991568c05da0a1d2/system.journal is older than the configured file retention duration (1month), suggesting rotation.
Nov 29 17:50:51 homeassistant systemd-journald[134]: /var/log/journal/6c2208e0ffb14c36991568c05da0a1d2/system.journal: Journal header limits reached or header out-of-date, rotating.
Nov 29 12:42:59 homeassistant systemd-resolved[445]: Clock change detected. Flushing caches.
Nov 29 17:50:51 homeassistant systemd-time-wait-sync[548]: adjtime state 5 status 40 time Wed 2023-11-29 12:42:59.732333 UTC
Nov 29 17:50:51 homeassistant systemd-time-wait-sync[548]: adjtime state 0 status 2000 time Wed 2023-11-29 17:50:51.536679 UTC
Nov 29 12:42:59 homeassistant systemd[1]: Started Network Time Synchronization.
Nov 29 12:42:59 homeassistant systemd[1]: Reached target System Time Set.
Nov 29 17:50:51 homeassistant systemd-resolved[445]: Clock change detected. Flushing caches.
Nov 29 17:50:51 homeassistant systemd-timesyncd[549]: Contacted time server 162.159.200.123:123 (time.cloudflare.com).
Nov 29 17:50:51 homeassistant systemd-timesyncd[549]: Initial clock synchronization to Wed 2023-11-29 17:50:51.536533 UTC.
Nov 29 17:50:51 homeassistant systemd[1]: Finished Wait Until Kernel Time Synchronized.
Nov 29 17:50:51 homeassistant systemd[1]: Reached target System Time Synchronized.
Nov 29 17:50:51 homeassistant systemd[1]: Started Discard unused blocks once a week.
Nov 29 17:50:51 homeassistant systemd[1]: Started Remove Bluetooth cache entries.
Nov 29 17:50:51 homeassistant systemd[1]: Reached target Timer Units.
Nov 29 17:50:51 homeassistant systemd[1]: Starting HassOS AppArmor...
Nov 29 18:10:51 homeassistant addon_a0d7b954_wireguard[560]:
Nov 29 18:10:51 homeassistant addon_a0d7b954_wireguard[560]: peer: REDACTED
Nov 29 18:10:51 homeassistant addon_a0d7b954_wireguard[560]:   allowed ips: 172.27.69.3/32
Nov 29 18:10:51 homeassistant addon_a0d7b954_wireguard[560]:   persistent keepalive: every 25 seconds
Nov 29 18:10:51 homeassistant addon_a0d7b954_wireguard[560]:
Nov 29 18:10:51 homeassistant addon_a0d7b954_wireguard[560]: peer: REDACTED
Nov 29 18:10:51 homeassistant addon_a0d7b954_wireguard[560]:   allowed ips: 172.27.69.99/32
Nov 29 18:10:51 homeassistant addon_a0d7b954_wireguard[560]:   persistent keepalive: every 25 seconds
Nov 29 18:10:52 homeassistant addon_core_configurator[560]: INFO:2023-11-29 19:10:52,761:hass_configurator.configurator:127.0.0.1 - "GET / HTTP/1.1" 200 -
-- Boot 9be89bbf284b4615962f2635136688c3 --
Apr 04 10:55:30 homeassistant systemd-timesyncd[546]: System clock time unset or jumped backwards, restoring from recorded timestamp: Wed 2023-11-29 18:10:21 UTC
Nov 29 18:10:21 homeassistant systemd-journald[134]: Oldest entry in /var/log/journal/6c2208e0ffb14c36991568c05da0a1d2/system.journal is older than the configured file retention duration (1month), suggesting rotation.
Nov 29 18:10:21 homeassistant systemd-journald[134]: /var/log/journal/6c2208e0ffb14c36991568c05da0a1d2/system.journal: Journal header limits reached or header out-of-date, rotating.
Nov 29 18:10:21 homeassistant systemd-time-wait-sync[545]: adjtime state 5 status 40 time Wed 2023-11-29 18:10:21.456318 UTC
Nov 29 18:10:21 homeassistant systemd-resolved[446]: Clock change detected. Flushing caches.
Nov 29 18:10:21 homeassistant systemd[1]: Started Network Time Synchronization.
Nov 29 18:10:21 homeassistant systemd[1]: Reached target System Time Set.
Nov 29 18:10:22 homeassistant kernel: bcmgenet fd580000.ethernet end0: Link is Down
Nov 29 18:10:23 homeassistant bluetoothd[525]: Path / reserved for Adv Monitor app :1.14
Nov 29 18:10:23 homeassistant bthelper[549]: hci0 new_settings: ssp br/edr le secure-conn
Nov 29 18:10:23 homeassistant bthelper[549]: [63B blob data]
Nov 29 18:10:23 homeassistant bthelper[549]: Changing power off succeeded
Nov 29 18:10:23 homeassistant bthelper[549]: [54B blob data]
Nov 29 18:10:23 homeassistant bthelper[549]: [50B blob data]
Nov 29 18:10:23 homeassistant bthelper[549]: [54B blob data]
Nov 29 18:10:23 homeassistant bluetoothd[525]: Adv Monitor app :1.14 disconnected from D-Bus
Nov 29 18:10:23 homeassistant bluetoothd[525]: Path / reserved for Adv Monitor app :1.15
Nov 29 18:10:23 homeassistant bthelper[524]: [63B blob data]

Nov 29 18:10:51 homeassistant addon_a0d7b954_wireguard[560]:   allowed ips: 172.27.69.99/32
Nov 29 18:10:51 homeassistant addon_a0d7b954_wireguard[560]:   persistent keepalive: every 25 seconds
Nov 29 18:10:52 homeassistant addon_core_configurator[560]: INFO:2023-11-29 19:10:52,761:hass_configurator.configurator:127.0.0.1 - "GET / HTTP/1.1" 200 -
-- Boot 9be89bbf284b4615962f2635136688c3 --
Apr 04 10:55:30 homeassistant systemd-timesyncd[546]: System clock time unset or jumped backwards, restoring from recorded timestamp: Wed 2023-11-29 18:10:21 UTC
Nov 29 18:10:21 homeassistant systemd-journald[134]: Oldest entry in /var/log/journal/6c2208e0ffb14c36991568c05da0a1d2/system.journal is older than the configured file retention duration (1month), suggesting rotation.
Nov 29 18:10:21 homeassistant systemd-journald[134]: /var/log/journal/6c2208e0ffb14c36991568c05da0a1d2/system.journal: Journal header limits reached or header out-of-date, rotating.
Nov 29 18:10:21 homeassistant systemd-time-wait-sync[545]: adjtime state 5 status 40 time Wed 2023-11-29 18:10:21.456318 UTC
Nov 29 18:10:21 homeassistant systemd-resolved[446]: Clock change detected. Flushing caches.
Nov 29 18:10:21 homeassistant systemd[1]: Started Network Time Synchronization.
Nov 29 18:10:21 homeassistant systemd[1]: Reached target System Time Set.
Nov 29 18:10:22 homeassistant kernel: bcmgenet fd580000.ethernet end0: Link is Down
Nov 29 18:10:23 homeassistant bluetoothd[525]: Path / reserved for Adv Monitor app :1.14
Nov 29 18:10:23 homeassistant bthelper[549]: hci0 new_settings: ssp br/edr le secure-conn
Nov 29 18:10:23 homeassistant bthelper[549]: [63B blob data]
Nov 29 18:10:23 homeassistant bthelper[549]: Changing power off succeeded
Nov 29 18:10:23 homeassistant bthelper[549]: [54B blob data]
Nov 29 18:10:23 homeassistant bthelper[549]: [50B blob data]
Nov 29 18:10:23 homeassistant bthelper[549]: [54B blob data]
Nov 29 18:10:23 homeassistant bluetoothd[525]: Adv Monitor app :1.14 disconnected from D-Bus
Nov 29 18:10:23 homeassistant bluetoothd[525]: Path / reserved for Adv Monitor app :1.15
Nov 29 18:10:23 homeassistant bthelper[524]: [63B blob data]
Nov 29 18:10:23 homeassistant bthelper[524]: AdvertisementMonitor path registered
Nov 29 18:10:23 homeassistant bluetoothd[525]: Adv Monitor app :1.15 disconnected from D-Bus
Nov 29 18:10:25 homeassistant NetworkManager[452]: <info>  [1701281425.2762] device (end0): carrier: link connected
Nov 29 18:10:25 homeassistant kernel: bcmgenet fd580000.ethernet end0: Link is Up - 100Mbps/Half - flow control off
Nov 29 18:10:26 homeassistant kernel: bcmgenet fd580000.ethernet end0: Link is Down
Nov 29 18:10:31 homeassistant systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
Nov 29 18:10:32 homeassistant NetworkManager[452]: <info>  [1701281432.3017] device (end0): state change: activated -> unavailable (reason 'carrier-changed', sys-iface-state: 'managed')
Nov 29 18:10:32 homeassistant systemd-resolved[446]: end0: Bus client set default route setting: no
Nov 29 18:10:32 homeassistant systemd-resolved[446]: end0: Bus client set MulticastDNS setting: no
Nov 29 18:10:32 homeassistant systemd-resolved[446]: end0: Bus client reset DNS server list.
Nov 29 18:10:32 homeassistant NetworkManager[452]: <info>  [1701281432.3440] manager: NetworkManager state is now CONNECTED_LOCAL
Nov 29 18:10:32 homeassistant NetworkManager[452]: <info>  [1701281432.3478] manager: NetworkManager state is now DISCONNECTED

I’m no expert, but these freezes do look like a dying power supply to me…

2 Likes