Is there a process developed to run 2 parallel instances on separate hardware and be able to fail over from a primary instance to a secondary?
My thought is have two devices (2ea HA Greens, or 2ea RPI5’s, or 2ea mini PC’s) and each one would run HA. The primary one would be updated by myself and the changes would be backup up and copies also to the second computer. If the primary fails, the secondary would start right up and become the primary.
There’s been a few threads about high-availability / 2 node clusters but AFAIK nobody has actually made this work. There’s no functionality in Home Assistant for it and you would run into issues for anything using a controller directly connected to the primary node (Zigbee or Z-wave in particular).
It would be easier though still non-trivial to do this if your mesh networks are using Ethernet / WiFi based controllers that are not directly connected to the Home Assistant instance.
In truth, it would be a lot of hassle for very little gain. Home Assistant is in general very reliable and if you’re worried about hardware failures, well then you’re just doubling your chances.
The better way forward is to be taking regular backups and at best keeping spare hardware around ready to be setup quickly and restored to. I’ve done that before and it’s less than a day to get things up and running again.
As you’re hearing, a Rpi is not up to task for high availability (what you’re asking for).
@fleskefjes has the right idea but I also agree that this:
Is absolutely true.
With good DR procedures and testing your backup on occasion you can have your restore back on iron or a VM in less than 30 minutes without all the hassle of High Availability. (yes I’ve actually done both many times, I keep a spare M.2 with a recent HA build image on it for just such an occasion…)
Do you REALLY need 5-9’s uptime? (hint most corporations don’t either and it’s very expensive.)
It is really not the Pis that are the issue, but rather that much hardware and many protocols are a one-to-one connection.
USB device is always one-to-one, so these will always be single point of failure.
Ethernet version of Zigbee/Z-wave will also often be one-to-one and therefore again a single point of failure.
The ethernet version of Zigbee/Z-wave will remove the device from HA, but running two HAs will often still not be able to keep the two HA instances in sync.
Adding a broker in between, like MQTT, can make it one-to-many, but also add another layer that then can be a single point of failure.
Matter is trying to tackle one of these single point of failure, but Matter is still a thing for the near future.
I have HA running on a machine on my IoT network. I run an isolated subnet that only houses a second machine. I then have just enough ports open to put backups from the first HA machine onto the backup machine. That way the backup system does not try connecting to devices but has a local copy of my latest backup. If I have a hardware failure on the primary machine, I restore the latest backup to the backup machine, swap peripherals and LAN cables and start running. I would assign IP address by LAN port so minimal settings adjustment.
Like I’ve discussed in a previous thread, the use case is totally valid but you don’t want to do that on appliances like the Green. Get something that can run a hypervisor and replicate and high-availability your heart out. If you want to be hardware independent you need to virtualize.
I use a Proxmox cluster and regularly replicate across 2 nodes. My Z-wave and Zigbee controllers are both TCP connected. I’ve never had an automated HA failover, but I’ve migrated the VM between nodes in the cluster for host maintenance without a hitch. I would expect the automated migration would go just as well. In my case the host nodes are different architectures (AMD & Intel) so no “live” migration is supported; HA simply stops the node and starts it on the other node. For Home Assistant (or other VMs I have setup this way like mosquitto and AdGuard) it has never been a problem.
If you have host nodes of the same architecture and sufficient replication bandwidth or NAS then live migration is supported by Proxmox without missing a beat.