Thinking about the issue I raised in this topic, I’m considering adding an external hardware watchdog to my remote homeassistant node.
I’m noodling it at the moment, but I foresee something like:
Battery (or super capacitor) powered, (but with 5/12V charging) ESP32 or similar base hardware controlling a normally on mains relay
Software that expects regular kicks from a number of sources
and opens the relay for (say) 30s if x number of kicks are missed.
The idea being that the controlled devices (Pi, Router) all need to kick the watchdog regularly. If one or more fail to do so for a defined number of cycles, the watchdog turns off the power to all devices (including itself) for enough time that the devices lose their memory contents before switching back on - the equivalent of me going to the remote site and cycling the mains power.
I realise that this solution could be catastrophic to the Pi, but it’s frozen anyway.
Before I go any further, has anybody been down this road before me?
I’d love to, @nickrout, I’d love to. Unfortunately, HAOS doesn’t provide the tools to let me do so.
To clarify, the Pi isn’t freezing. I can still access homeassistant, but it cannot talk to the addons or supervisor.
Homeassistant only lets you see logs from the current boot. I can see from the Terminal addon that there are old (binary) journal files in /var/log/journal but HAOS and the addon supply no tools to explore them. Neither do they seem to exist in the apk repositories. Maybe they don’t exists for buildroot from which HAOS is built.
Not being able to see what was happening when something went wrong denies me the opportunity to debug the problem.
I just had similar issue. Nothing worked but lovelace, immediate states and even control. But no history, no settings, no addons, no system settings, I could only interract with switches. Couldn’t reset because reset button in companion app (on phone) didn’t work. Webpage on PC did not load at all. Complete(ish) hang.
I am thinking ESPhome based relay on power source for RPi, where it works for the first 3hrs from start and then requires heartbeat, probably from automation, as my ESPHome devices were still working with HA (so no way of detecting that issue thru API connection).
But only if you’ve previously enabled access using the tip in the previous comment.
Since this problem occurred (which did require a site visit to recover), I’m wondering if it would be possible for the OS to make use of the hardware watchdog built in to the Pi. I’ve seen references to adding the string
dtparam=watchdog=on
to config.txt. I’ve no idea of HA OS can make use of this facility though.
In the meantime, the next time I visit, I’ll enable this ssh backdoor into the pi.
Although this topic is a few months old I would like to contribute my experiences with the exact same issue.
I played a lot around with different power supply’s, cables and even an ups hat but the issues persist.
At the end the reason was my m2 SSD (Samsung 850 250gb) on which the HasOs was running on. I used a simple one from Amazon with usb-c, connected to a usb3 port of my RPI4. It seems that every time the drive went into its internal NAND refresh cycle the current draw caused a voltage drop which at the end crashed die external m2 case.
I’ve added an active usb3 hub between the pi and the m2 drive to make it not longer depending on the PIs power. This issue might also apply to m2 ssd hats which are connected to the pi’s usb port.