Raspberry pi HA crashed - help to identify root cause and misc mysteries - next steps?

yesterday at 01:00.07 my HA crashed (? at least it was no longer on network and unresponsive in physical tests) (raspberry pi 4 8gb) - this has never happened like that and I am looking for how to identify what caused it?

  1. everything was up to date until last week.
  2. looking at it, it seemed to be running but:
  3. It was not connecting to the network, no ip, no ping possible
  4. it was not actually running any automations where it did not need network (eg flow meter to turn on relay)
  5. the ssd drive was running (green/red light as usual) and had loads of storage
  6. eventually in debugging I added an additional power source on ssd drive

after cycling restarts (10x during the day, giving it about an hour between each to be sure (?)) it actually started!? (took 3x after i added the extra power source on ssd, so I don’t think it actually is the solution as it did not work directly, but left it on “just in case”)

on its startup:

  1. immediate error in interface was that supervisor for bluetooth did not start I disabled this. (but I think this was an effect of it just having trouble starting… but maybe this was it? )
  2. no logs or errors that I did not have previously showed up on startup
  3. it was very slow and strange - thinking I should reflash the card I wanted to make sure I got the data I ran a backup … the backup is now 2gb where it had swelled to 4.8gb in recent weeks quite suddenly filling my local card so a few weeks ago I moved the data to the SSD (figuring out what actually was causing this was on my todo but I have not identified it and now it seems to be gone?) - but now what did i loose and which backup should I use to restore from?
  4. I then did a full reboot
  5. it now seems to be running fine. eg water flow is triggering relay in system and I see it running physically
  6. did not crash again at 01:00.xx
  7. processor logs show that it was quite high on monday at 01:00.xx but even has some recording of this after this but everything else was not recorded (eg water flow)… so seems something else broke… disconnected from network? (unfortunately I turned on the processor logs a few days ago along with storage space so I can not see if it is normal for a “monday @01:00.xx”)
  8. logs for water flow counter show it was working, and the relay indicates it likely turned on … but i know it did not actually turn on in a physical test which I tried multiple times.

any help is appreciated:
eg
to identify what could be the cause? or advice if I should just do a restore?
any ideas how to figure out what data has been pruned between backups (the now missing 2.8gb!?) or which restore would you use, the 4.8gb from a few days ago or the now 2gb?

Have a look here.

But more than likely PSU or SD card.

1 Like

thank you!
I had been working through that!

there are a couple of weird things that maybe I was not so explicit above about

  1. the loss of data … i guess i need to go and compare the data to figure out what got pruned.
  2. honestly I was surprised when it actually connected … after restarting 10x!? and seems fine now… so it seems maybe strange it would be the sd card?
  3. when uncontactable, the processor was recording but automations did not work? maybe it was in a kind of safe mode?

been analyzing a bit

1. backups mystery of the missing GB

the backups to figure out why they are so different in size and something strange I think… the db file moved where it was between backups (i have extracted after the backup homeassistant.tar into the directory above:

anyone know if this is “normal” that it moves position? could it be that it compares to when I had the db on the sd card and on the ssd and it moves in the backup?

then there is maybe an issue that it is lighter now … is there any way to see what the difference actually is in the db files?

been monitoring the storage and it basically is pretty flat now… so it is not growing even like it was (?)

I am not seeing any missing data at this time… but I will keep looking, maybe it is in statistical and things I am not looking at normally (?)

2. system monitor

took a look at system monitor data, again, I only recently turned this on… but I can see the processor was working hard earlier that day and then dropped significantly when it went off network/ off- line (green line) :

I guess it would act like this … dropping when it went offline
it does look like memory did not clear really until I restarted it (turned off power as it was unresponsive) around 0600 … the 0500 drop I can not account for…

disk usage stays the same, so it does not look like I lost data (?)

any way to figure out what was making the processor go wild running up to this?



again, any ideas or further advice to avoid this in the future are appreciated!

After quite a bit of looking around … I discovered what caused the crash!

It was a flow sensor on an ESP32 via ESPhome that went crazy and was just running millions of ticks while the flow was actually off… these ticks coincide perfectly with the processor peaking.

Turned out to be caused by a dupont connection that was not good that vibrated itself to a indeterminate on/off state - whereby processing this intermittent connection caused the processor to go bananas as shown above to just count the rotations (no calculation).