Hi All-
Sorry that my first post is a problem enquiry.
My HA install has been crashing for about the last month or so, more and more frequently. I have not been able to pinpoint anything, so far. I’m looking for ideas to help isolate the cause…or alternately, a way to recover the system without starting over from scratch. I recognize that I haven’t provided enough information to SOLVE the problem, below. But given the nature of the problem, conducting experiments a collecting data has been difficult. So, again…I’m looking for suggestions of things to try, places to look for data I might not be aware of…etc
A bit about my system:
I’ve been running HA on an RPi4 (8GB) for a little less than a year. I have the RPi in a Argon One case with heatsinks, and a fan running at full-speed. The CPU temp never gets above 125F. The ARgonOne case has a M.2 SATA to USB3 slot, and I have a 256GB SSD installed as the boot/data drive (no SD-card is in the slot).
Devices:
- TP-Link Kasa switches,
- Treatlife switches flashed with ESPHome,
- Zigbee plugs,
- and Aquara door/moisture/vibration sensors.
- My Zigbee radio is a Nortek Zigbee/Z-wave stick using ZHA (I don’t have any Zwave devices).
- I also have a few google devices: chromecasts (TV and audio), nest speakers, and Home Minis, and a Vizio Smart TV.
Add-Ons
- ESPHome, File Editor, Good Drive Backup, Moquitto, Samba, Studio Code Server, Terminal & SSH, ZWave JS, NodeRed
- ArgoneOne Cooling
- Unifi Network Controller
- Assistant Relay
- Tailscale
- Network Ups Tools
Automations
Most of my automations are in NodeRed, I have a couple minor ones still in the UI/native engine. I don’t have any blueprints.
Symptoms:
The system crashes randomly. Often within a few hours, sometimes minutes, sometimes a day or two (rare).
I have seen several instances of SQLite database corruption errors in the logs, when I can see them. Often, it takes two reboots (power-cycle, system restart) before the system will run long enough to inspect the logs…so, many times I can’t get to see the homeassistant.log.1 file before it gets over-written by another reboot.
Most of the time, when I can get in to look at the log there isn’t anything obviously interesting in the last few moments before the file ends. The SQLite corruption is most often in the next startup, possibly because of crash / power-cycle while the database was mid-transaction.
I’ve been working to disable add-ons or integrations that I don’t use much or can easily be enabled/re-installed without a lot of extra work. And, see if the system stabilizes. Eg: NodeRed is disabled…so, all my automations are no-longer running. I’ve also been correcting any “ERRORs” that I did see in the logs to eliminate them as a source. So far, nothing has made any difference.
Because this came on subtly / slowly, I’m not sure exactly when it started…or what I may have added that initially caused the crashes.
Thanks for listening…