HA install failed me at the worse possible time - I found aqara water sensor floating in flooded basement

I’ve been running Home Assistant on a Raspberry Pi 4 since Aug 2022. I"ve been building and expanding it since. I have a Zigbee network with
approximately 100 devices connected via Zigbee, wifi, Bluetooth, and LoRa. I have 7 Aqara water detection sensors in the house with two automations. 1 automation is set to notify me if any sensors go offline. The other is to notify me if any sensors detect moisture. I’ve had great reliability with HA OS running on the Pi 4 until yesterday. I had a worst-case scenario failure.

I came home from work with a couple of inches of water in the basement and my Aqara water detection sensor was floating. I went into HA to see what the sensor status was expecting it to be offline. I wasn’t able to access HA at all so after getting the water issue managed I physically cycled power on the Pi at 12:45 pm and started investigating. My sensor data showed me that everything was working fine until 6:04:07AM and then all HA entities froze for the next 6 hours and 45 minutes. What is interesting is I can see that Home assistant was only working for 24 seconds after my wife woke up. She woke up at 6:03:43 as I can see from the motion sensor under the bed. I can see her exact movements through the house via the door and motion sensors for the next 24 seconds into our kids room. The next HA change I expected based on her movement after getting our kid was motion in the kitchen, but that never registered. All data froze at 6:04:07AM and didn’t start updating again until I power-cycled the HA at 12:45pm. I still am shocked that this happened just an hour or so before the basement started flooding.

I went to look at the logs but both HA core and supervisor logs were already overwritten. I didn’t realize you could change the default number of log entries before they start overwriting which I will now do. I also never realized I get a lot of warnings that result in the log filling up with 50 entries in just a few hours which I also need to fix.

I have 3 questions:

  1. Is there any way I can determine what happened with the home assistant core and supervisor logs overwritten?
  2. If not can I at least determine if it was the HA OS or HA core that had the issue based on how the issue manifested? All entities just froze in their current state
  3. Most importantly what is the best hardware or software watchdog solution to ensure I am immediately notified if HA Core, Home Assistant OS, or the hardware itself becomes unresponsive?

Have you checked the .log1 file? That should be the log from the previous boot (ie: before you restated)

image

There surely are many, but a simple, and free option I found was Uptime Kuma. It can be installed even as an HA add on but I would opt to have it on another machine so that it can better report issues with HA. Mine runs on a different Proxmox node in a VM.

HA saved me from water damage a few times already. I also added a couple Zooz Titan to automatically shut my water off. One nice feature is that the Titan has its own leak sensor that will shut off the water regardless of HA running or not. Of course this works only for leaks where the shut off valve is… in my case, one where the water softener + self cleaning filter are, and the other where I have the tankless water heater.

For best availability you could dedicate a small PC as a vm host and put HA in it. The supervisor can give you warnings, etc based on states of the HA vm.

@sparkydave I had not checked those but did based on your suggestion. Unfortunately home-assistant.log.1 covers the time starting about 10min after the reboot to 17:40. Then the home-assistant.log covers the time 17:42:36 to current. There is no gap in sensor data making me think home assistant didn’t reboot but the log file for some reason did start over. I did at least learn that these are there so thanks!

Thanks @odwide and @aruffell I was hoping for an easy local software-based solution but do understand if you are running it on the same hardware as HA then there is a failure mode I can’t catch. I guess I’ll bite the bullet and figure out how to run a dedicated watchdog. If only running a watchdog could I use a pi zero as a short-term solution? Eventually, I will get a home lab set up and run docker but I don’t want to spend the time doing that right now. If pi zero is not recommended do you have another small low-power PC I could use?

Might it be possible, that it wasn’t HA that crashed? I’m asking because you’re mentioning Aqara Sensors. Those that I have are all paired via Zigbee2MQTT. So in my case either the MQTT Broker or Zigbee2MQTT could malfunction, while HA still keeps operating.
So in case you use both of these as Add-Ons and have automatic updates enabled, one of those might got stuck during the update process. I recently had NodeRED stuck in the updating state for about an hour. As I only use it for a single, unimportant task, I just waited until it completed. But since I have seen that, I have disabled automatic updates for all Add-Ons to ensure this won’t happen while I’m away.

Sure!