Well, I don’t want to push you into something that would cause a catastrophe in your tank, but that’s not been my experience at all. In fact, my Pi’s / HA instances have uptimes of significant length.
Just before Xmas, I rebooted the Pi in the attic running HA and it’s uptime had been 180+ days. But at this minute, this is the longest running HA / RPI instance:
A couple of things I did to relieve myself of that worry tho:
1.) Don’t run cutting edge releases. Unless there’s an applicable security vulnerability, I tend to stay behind the pack a few versions. This gives others a chance to flesh out identified bugs etc.
2.) Designed the high availability setup. It’s more important that the service of HA be available than any one instance of HA in my setup.
3.) Abstracted out things like Heaters, Dosers and ATO so that any instance of HA could take over, if the controlling HA instance fails. If a non-controlling instance of HA fails, the controlling instance of HA will attempt to recover and even hard reboot the non-controlling instance, while raising alarms.
And it works the other way too. If say the Fishroom located “CoralPi” (where sump, heaters, temp probes, etc are) were to crash, Joshua - the primary HA controller, will switch to using NemoPi (in the living room, where the 3 display tanks are) as temp inputs. Since heater switches are networked, there’s no dependancy on CoralPi.
Or, if Joshua were to die in a puff of purple electrical smoke, all other remaining instances of HA will raise alarms and notifications. NemoPi would take over Dosing and CoralPi would take over ATO and Heater functions, until Joshua reappeared.
I’ve tried to make everything “fail safely by design”. And so far, it’s worked exceptionally well. I’ve had 2 SD card failures which weren’t big issues, because of the above, and because everything gets backed up and there’s “cold standby” SD cards taped to the RPI’s for even if my wife is alone in the home. But those are all about to be upgraded to SSDs soon, and that risk will be gone.
We have had a 24 port Cisco die at 5am. Joshua woke us up for that. Cold standby switch got dropped in and I was back in bed within 25 minutes. We had a firewall with blown Caps die on us, and everything kept working just fine through the night. (But again, HA told us about that.)
Three times tripped circuit / GFCI breakers downstairs, and HA told us about that immediately (because each HA instance is on it’s own breaker).
I think with the right architecture planning, you can easily mitigate a lot, if not all, of your concerns with HA. It’s what I sat out to do when I began this - basically replicate the same way I’d build GSM networks with 99.97+% service uptimes, for 15 years.
Oh, last thing… sometimes people ask me about UPSes for all my RPI’s, but I often ask “why? I’ve made the solution resilient to suffer with // work around more than a single RPI failure, regardless of Mains power failure, or a crashed OS / HA instance.” And UPSes only cover mains failures. They won’t cover for that case where one of my 5 cats decides to pee on the RPI.
As a risk management professional, I try to plan for the failures we cannot yet imagine. And that’s the philosophy I took in my design. I encourage others to do likewise.