High Availability Clustering

I’ve finally gotten around to writing up my High-Availability Home-Assistant setup. A reasonably detailed description is available on my wiki and I have my config up in Github.


  1. Setup MQTT clustering
  2. Use secrets.yaml to store unique variables
  3. Create a sensor which indicates which instance is active
  4. Enable/Disable automations based on which instance is active

There are probably a lot of edge cases and integrations which this wont work well with, but I’ve not had any issues with my modest system.

With a little modification this could be made to do load-balancing as well; ie half your automations run on one instance, and the other half on the other instance, with complete fail-over if an instance goes offline.

Interested to hear how other people have gone about doing this, or if you have any suggestions.



Great write up!

However, I think that this is - as you even wrote yourself - is not applicable to a lot of the set ups here.
Many use ZigBee or ZWave networks and other external devices.
Another issue would be with addons like Node-RED.

But I like the progress and ideas you and others come up with!
Definitely worth looking into as a Home Assistant outage can be pretty bad if your devices are only controllable through Home Assistant.

If that’s the case then that system is poorly designed. At a minimum every “critical” device (lights, outlets, HVAC) needs to be able to still be manually controlled if your HA is down.

Indeed. That is the case for me. I can control everything even when HA is down.

G’day Cludch,

Provided you have some method of replicating state data read from ZigBee/ZWave across to the other instance, and similarly injecting commands from each instance back into the ZigBee/ZWave network then this shouldn’t be insurmountable.

Additionally, Node-RED should be able to use a similar arrangement of tracking the state of a sensor to know whether its automations should fire or not. It would require adding this condition to each Node-RED script though, which could be tedious. Mind you, to get this far you’ve already had to create a whole separate instance of HA, so maybe not such a big deal.

I less involved, but also less responsive method could be to start/stop HA. This avoids needing to suppress automations on the standby instance, but does have significant hand-over latency.