High Availability HA

anon3881519 · December 31, 2018, 9:58pm

I’ve been picking away at this a little myself and you’re right, it entirely depends on what you’re running. Towards that, I’ve tried to limit the number of communication protocols used. I am working really hard to limit my primary protocol to MQTT because it can be completely independent of the HA controller. Also, even though MQTT doesn’t natively support load balancing or failover, there are some folks that have developed a basic MQTT server using the protocol that does.

From there, it is entirely possible to have two instances of HA running on two different Pis as long as each share an external recorder database (in my case, using mysql) and one of the two has all of the automations turned off.

I have been running two HAs for a few months so I can see what kind of issues arise and I’m happy to say that very few issues have come up as long as automations are off on the secondary instance.

I haven’t tested it yet but HAProxy should work for handling the URL routing.

My bigger issue is with z-wave. I’m thinking that an external gateway would be the most practical solution and have it run through HAProxy as well but that is just speculative at this point.

raas · January 1, 2019, 7:37pm

That link carries a lot of value added input for me AndrewHoover, thanks!.

question though… does your mysql-db have HA/FailOver?

nragon · January 1, 2019, 8:56pm

Well, currently it does only the monitor part of HA systems but it’s a start: https://github.com/nragon/keeper

anon3881519 · January 1, 2019, 10:39pm

No, not yet. Honestly, I haven’t gotten that far yet.
I’m glad the link was useful. That project introduced me to HAProxy which was a very useful find.

anon3881519 · January 1, 2019, 10:43pm

Very nice and simple. Can it be used to talk to other devices in the same way over MQTT?

nragon · January 2, 2019, 8:40am

Currently, the main focus is HA. Yes, the mechanism could be applied to other services but they would’ve to implement the heartbeat mechanism to send messages to keeper. Personally I think this project should, for now, focus on Home Assistant which is the principal point of failure. That said, this only performs service restarts but I’m open to suggestions such as backups (or others) which is another piece of high availability.

raas · January 2, 2019, 6:36pm

nragon, very nice indeed.
Digging into HAProxy, could we have multiple instances on this one… What I mean is that ‘entry-points’ should always respond. I saw some pictures about only one entry point, the load balancer, but what if that craps out.

anon3881519 · January 2, 2019, 7:20pm

Historically, load balancers are by default a single point device/service and subject to being a SPOF but ordinarily, the load balancer is the only software on the device and is therefore very stable.

Generally, you can’t foolproof everything but, if you ensuring failover or redundancy on the most complicated pieces, that will greatly reduce risk.

My goal is to have redundant access points, redundant HA hub, redundant database, redundant MQTT and backup power for everything. Outside of that, having replacements of the rest (like network switches, routers, cables, etc.) and a restore plan that is documented, is about as good as you really need. Ultimately, making sure that your home automation system fails dumb instead of stupid, is your overall best possible scenario.

I wrote up my goals and assumptions as someone who is designing my system for a person who has a disability (my wife) in this article if interested.

nragon · January 2, 2019, 9:16pm

I think load balancing is not the main issue here. Availability is. Does any of you get 100%+ on HA?
@anon3881519, Keper can totally evolve into an active/passive approach deployed in multiple machines. Currently the downtime is the time needed for HA to start.

raas · January 8, 2019, 9:52am

So, I was reading up on Docker, and stumbled upon Docker Swarm.
Seems that with its setup of ‘managers’ and ‘worker’ nodes, High Availability can be achieved.

nodes can be both physical or virtual.
So with ‘Docker on Docker’ we could add Raspberries or other hardware that runs docker and add it to a cluster.

Found a very nice tutorial here:

Official documentation:

https://docs.docker.com/get-started/part4/

Found out about swarms when I stumbled upon this: (4pi’s, 2 managers and 2 workers)

Cheers.

anon3881519 · January 9, 2019, 1:04am

These are some great resources. Thanks for the links!

I wonder how “swarm aware” HA would need to be to pull this off effectively.

raas · January 9, 2019, 2:14pm

Read on here, seems that it’ll work. user quasar66 has a pretty decent setup running!

drpump · May 29, 2019, 6:47am

There’s been a fair bit of work recently on high availability configurations of kubernetes, and also on using rasberry pi clusters for kubernetes. Assuming you have a high availability kubernetes backplane (i.e. multi master), storage and multiple cluster nodes, high availability for home assistant (excluding Z-Wave) should be possible through:

Configuring HA to run in a kubernetes pod or pods
Deploying a single instance of HA (i.e. your configured pod(s))
Allowing kubernetes to detect failure and restart the failed pod(s) on another cluster node

The kubernetes backplane takes care of monitoring, failover and routing. Failover probably wouldn’t be instant, but tuning could make it acceptably fast.

k3s (https://k3s.io) looks particularly promising, but doesn’t have a high availability config yet.

runningman84 · June 1, 2019, 6:09pm

I run home assistant in a kubernetes cluster using k3s. Even in a non ha cluster you can survive node outages if the master keeps alive. If the master is broken your workers will just continue with their work. A master node can also act as a worker node at the same time. This means in a two node setup you will survive the outage of any one node. With a three node setup you might survive two simulations node outages.

K3s will provide an ha mode in the near future.

drpump · June 10, 2019, 7:59am

Would be interested in more detail, if you have time to write.

cvb941 · July 26, 2019, 3:04pm

Hello everyone ,

I have recently created a project called HAHA - Highly Available Home Assistant, which creates a highly available cluster which runs Home Assistant.

It runs using Docker swarm and includes a MariaDB Galera cluster and Mosquitto MQTT broker. It uses GlusterFS to synchronize Home Assistant’s files, so that in case of a failure, all of the state is transferred to other nodes.

More details about the project in this thread: HAHA - Highly Available Home Assistant

Huey11 · January 17, 2020, 7:50pm

Hi, would you like to share your config? Tried to use the helm config but can only get it to work with hostnetwork=true on k3s?
Are you running it with hostnetwork?

I would like to not use hostnetwork and use metallb as loadbalance for a fixed ip-address to connect inwards and traefik as reverse proxy. Anyone got that working and able to share his/her deployment manifest?

Edit: ok, like always, after posting I find my errors… Seems to work now, although not yet with runAsUser, which I would like. If anyone is interested I can post the deploy I currently use.

Edit2: saving to lovelace ui fails. Needs some work…

Regards

runningman84 · January 19, 2020, 9:45pm

Yes I use the host network because otherwise some plugins which use upnp don’t work.

techdesk · April 19, 2020, 3:32pm

Actually, such a problem could be avoided altogether. Think of MS SQL Always-On. This is setup as a cluster. Each individual device has an assigned IP, then there is a “shared” IP for the cluster (only one-node would actually be active at a time). In the case of MS-SQL, for Write functions, this is true, and only one of the instances is in “write” mode, while the other is strictly in read mode. For read operations, it would really matter which node is being looked at.

In this way, delays can be minimized. This would also offer redundancy in that there are actually two complete copies of the DB. Biggest thing to work out is syncing data across both instances. This could be applied to services as well.

I have had an instance of HA fail for no apparent reason (running on rPi). Re flashed the SD card, and restored my config, and everything is up and running again. If I had a cluster though, there would have been no outage, and perhaps even a restore could be achieved faster, since a clustered setup would have to sync the configurations.

Just a thought. A native HA “cluster”/“load-balanced” config would be very cool.
This could be processor intense, so depending on your load, might require more than a rPi3, maybe an rPi4 minimum.

Just my two cents.

Come to think of it, clustering in Linux already exists. If someone could "sync the HA DB, and the configurations (rsync), then this should be possible running HA under linux on a couple of small machines.

ryansdistrict · April 6, 2021, 8:15am

+1 here. I would also like to see this natively supported