HAHA - Highly Available Home Assistant

cvb941 · July 26, 2019, 2:50pm

Hi everyone ,

I wanted to find a solution for running Home Assistant with high availabilty - with a backup failover in case something breaks, like when the SD card in Raspberry Pi decides to die.

After searching, I quickly found out that no ready-made solution to this problem exists yet, and the ones that do, do not solve issues with state transferring or are complicated to setup.

With inspiration from user quasar66 and his setup described here, I went on to attempt to make a simple to run solution for creating a redundant cluster, running Home Assistant.

Well, after a lot of trial and errors, I managed to create a project called HAHA (which stands for Highly Available Home Assistant) and today, I would like to share it with you.

It’s features are:

Runs on Docker swarm
Easy setup using Ansible playbooks
Preconfigured MariaDB Galera Cluster for the recorder component (thanks to Colin Mollenhour)
Included Mosquitto broker
Uses GlusterFS for synchronizing Home Assistant logs and files and Mosquitto retains

It is made to be run using three and more Raspberry Pi devices, but can technically run on just two. More details can be found in the GitHub Repo. Please try it out and let me know in the issues if you encounter any problems.

Link to the GitHub repository: https://github.com/cvb941/HAHA

quasar66 · July 27, 2019, 10:49am

Nice! I will take a look.

quasar66 · July 27, 2019, 10:55am

To have highly available zwave, using gen 5 aeotec sticks you can do this:

For sticks plugged in a pc that’s in the docker cluster

backup the stick after everything is done and it works
restore backup to another 1/2/x sticks
plugin sticks to different nodes
label the nodes with the sticks with has_zwave=1 or something
make sure home assistant starts on those nodes

For “remote” sticks (socat/ser2net, usb-over-ip) it does not matter on which node home assistant runs

If you want “true” high availability for zwave

one master primary stick
one master secondary stick
one dedicated home assistant for each, integrated a single mqtt browker
the main homeassistant should send commands to that mqtt brokwer and the primary or master one will read and run the commands

quentinsf · July 27, 2019, 3:57pm

Interesting - thanks! I’ve been wondering about this since HA started to be more reliable than the hardware I was running it on

I also hadn’t come across Gluster, which could be useful in other contexts.

fwal · August 2, 2019, 12:53pm

Nice! How does it handle automations, specifically preventing triggering the same thing on all nodes?

quasar66 · August 21, 2019, 11:07am

only one instance of home assistant is running at a time. if it goes down it starts on a different host

billimek · December 10, 2019, 9:37pm

Another approach to consider is running Home Assistant in Kubernetes (there is a helm chart already for it), and allow kubernetes to handle the scheduling like this solution aims to do.

matti · November 25, 2020, 5:04pm

It seems this is most fresh thread on this redundacy aspect. I’ve come to share my take on the matter which seems to work quite well.

What I’m doing:

I have configured everything on my HA so that it can be straight copied to another server/instance and once the main HA server fails, an automation will activate all alerts, other automations etc. on the backup HA server
I transfer 3 times a week the whole main HA to the backup HA and restore the latest snapshot so that the backup HA stays up to date automatically. So I won’t need to update two times the same things…

How I’m doing it, several components involved:

Samba share -addon to easily open route for file transfer to the (backup) server.
Samba backup -addon to backup everything to the backup HA instance
Command line sensor to check which servers are online:

  - platform: command_line
    name: main
    command: curl -m 3 -s http://main.ip.add.ress:8123/ > /dev/null && echo ON || echo OFF
    scan_interval: 60

  - platform: command_line
    name: backup
    command: curl -m 3 -s http://backup.ip.add.ress:8123/ > /dev/null && echo ON || echo OFF
    scan_interval: 60

local_ip: in configuration.yaml for a sensor for automations to detect in which instance the HA is running
Automation constructed from the local_ip -sensor together with the main server -sensor so that automations and other unwanted overlapping processes are turned off in the backup -server unless the main server is on (for 2 minutes)
Another automation vice versa to turn everything on in the backup -server if main server is off for 60 minutes.
And one automation for restoring on the backup server the up to date snapshot 3 times a week with shell command
The shell command that parses from ha snapshots list the latest snapshot (I hope so, I’m not sure if this works yet as the list command seems to give them in random order. On my first testing the “ha snapshots list” -command listed the latest snapshot as first, so this command takes the first “slug” in the list and restores it.

shell_command:
  restore_latest: variable=`ha snapshots list | grep 'slug'|cut -f 2 -d ":"|head -1` | ha snapshots restore $variable

EDIT 26.11.2020: The shell_command is not working this way. It does not accept piped commands if I understood the documentation correct. I have tried now following without success so far:

Created restore.sh with the same command and running it with shell_command “bash restore.sh”.
Tried to create command_line sensor to grep the latest snapshot slug code to be used in automations data which would launch the HA snapshot restore service. I’m unable to get this work either.

So what is important to understand: I’m running exactly same copy of HA on two different instances but with different static IP addresses. This way I’m able to build automations that make it possible to turn on/off services depending on the server the HA is on and depending which server is online etc.

With this methology one can build quite complicated system. I’m on early stages on testing. Fingers crossed!

Edit 27.11.2020: I’ve now given up on the automatic backup restoration on backup server. Reasoning on this post: Update notifications! Core, HACS, Supervisor and Addons

HomeAssistantUser2 · June 4, 2022, 7:04am

I just read the GitHub readme… it is such a great project, I was looking forward to using but you mention this isn’t for x86 hardware. Maybe I can adapt it, thanks for sharing!

ryan-ronnander · February 12, 2024, 7:34pm

While this is an older thread, I do have a high availability solution that works on x86_64 as well as ARM64/AArch64. Note at this time you will have to compile a few packages from source when using ARM64/AArch64.

Like many of you out there, I have recently found myself more and more reliant on Home Assistant. After looking into Home Assistant high availability I found there aren’t really any options supported by Home Assistant.

So I went ahead and built a solution using existing open source high availability components such as DRBD and Pacemaker. Home Assistant can be made highly available fairly easily using existing tools that have been around for well over a decade. These same tools are used in countless enterprise-grade production deployments. Full disclosure: I am also employed by LINBIT, but I’m not here to sell you anything. These tools are just as free to use as Home Assistant and are part of the open source ecosystem. The required software components can easily be installed through our PPA.

I am personally a big fan of running Home Assistant as a Docker container. My blog post walks you through the process of making Home Assistant highly available using containers. With this solution, Home Assistant’s data is mirrored over the network in real-time (by DRBD, think network-based RAID-1) while Pacemaker controls which node is currently active (replicated filesystem mounted, virtual IP active, and Docker instance(s) running, etc). Failover to a secondary node takes mere seconds once the primary goes down.

I hope to make this topic into an ongoing series with more and more content.

GLehnhoff · March 22, 2024, 6:12pm

@ryan-ronnander , before I read your web site, does this really work if you have antennas, e.g. ZigBee? I have lots of devices on ZigBee and HomeMatic.

ryan-ronnander · March 22, 2024, 8:06pm

While I don’t personally use ZigBee and have a very reliable WiFI network, it should work. Conceptually it would function similarly to how a virtual IP address is passed around in a high availability cluster depending upon which host is currently “active”.

FlyingKangaroo · March 27, 2024, 5:03pm

Hi Gentlemen very nice & interesting topic, on HA-HA. My question is the following: Do you think that HAHA could be bone using HA operating system on different physical machines? In the reading I understood that it was done by somebody on 3 raspberry pi, but I did not find many details. Actually I’m running HA on X86 machine (2nd hand thin client computer).
Any feedback will be welcome

PKBreck · July 26, 2024, 5:49am

I haven’t learned Kubernetes yet, but am working with Proxmox to build a high availbility cluster w/CEPH storage. Proxmox claims to be able to perform VM and LXC seamless failover when a cluster node fails. CEPH is supposed to be resilient agains media and controller failures, and is supposed to work if the components are geographically distrubuted

My fantasy is that I set up Proxmox+CEPH, creat VM running HASSIO/homeassistant with live failover to other cluster nodes this would create an inherebtly HA Home Assistent system. Could it be this simple?

nickrout · July 26, 2024, 8:19am

You can tell us when you have done it.

PKBreck · July 26, 2024, 4:08pm

So, should I color you as skeptical? I appreciate your applying Yogi Berra’s “It ain’t over until it’s over.” wisdom.

Several approaches described in this thread seem to rely on some HA-specific customization. I’m hoping to avoid this. The Kubernetes suggestion seems to imply that it could provide a transparent/automatic solution requiring no HA customization, but as with my Proxmox/CEPH idea, it’s just an idea. Since I have no experience with Kubernetes I have no idea where wrinkles breaking transparency might be.

Although I’m an HA adopter from way back, I’ve only been digging into Proxmox for a few months. So far, given that I’m using a purely IP-based, hardwired setup (i.e. no Zigbee/Zwave/Thread bridge failover needs), the Proxmox+CEPH approach to application-agnostic high availability clusters seems to check all the boxes. But it’s not proven to work for Home Assistant.

As you implied, things are seldom as simple as they seem. With all the thought the folks on this thread have put into the subject this seems the natural place to ask “Could it really be this simple?”.

nickrout · July 26, 2024, 8:21pm

Not skeptical, but I can’t test it. You seem to have the knowledge. I have never used ceph and don’t really know what it is capable of. So give it a go and educate us.

PKBreck · July 27, 2024, 2:07am

Fair enough. I’ve got just enough knowledge of both to be dangerous, and am too ignorant to know it can’t be done. If someone who understands them better (i.e. 99% of you) can point out possible potholes, I’d love to know where they might lie.

Wireless bridges would be one example, but since I’m ditching those anyway, at least that is eliminated in my case. Another might be that CEPH requires at least three storage nodes in the cluster to even start up, since its entire purpose is to provide a cluster with fault tolerant mass storage. Three servers just to provide a single HAHA service might seem expensive, unless some nodes are RPIs (which has actually been done).

One that just occurred to me that could be very application (e.g. HA) specific is whether the granularity of Proxmox’ live migration matches how HA updates the state of the system. Somehow I doubt that Proxmox provides instruction level granularity when doing migration. CEPH also has some latency when replicating writes because of storage and networking latencies. A mismatch here could show up as intermittent and very hard to replicate failures. In this case, even a “try it out and see how it works” approach is just a Monte Carlo test. An opinion based on understanding how HA handles state updates might be crucial. If changes to the HA Core are needed to make it work then my idea is a bad one.