Home Assistant running on a high availability - fault tolerant mini cluster made of two Raspberry Pi (Pacemaker, Corosync, PCS, NIC bonding)

@fdlou147 unfortunately I have no plans to support any USB device. My board only has support for ZigBee at this time.

1 Like

That’s fine but still amazing write up and am going to follow. I’m going to use your idea and hope to get a second NIC teamed on my installation

1 Like

Hi,
Would your setup allow to have redundancy of the full system, so if one pi fails the other takes over.
That would include any app, not only ha, running on the pis.
I’m looking how to make my network monitoring pi redundant.
thx
Stefan

As long as your app runs as Linux services it can be managed by Pacemaker and made it redundant on two PIs so that when one goes down it fails over to the other one.
If you have two PIs you can try.

I do have two pi 4.
Each is running pi is 64 bit.
Apps are mostly deployed as docker.

Is there a tutorial showing how to set up redundancy?
Or could share some bullet points on what you did?

sent from a fair mobile

On top of this thread there is a sort of guide I made some time ago, you can refer to that and try to adapt it to your application.
Unfortunately I have no experience with Docker.

1 Like

thx that is helping me, so it is only pacemaker and virtual / cluster IP.
That could even more interesting concept by integrating a shared network storage, so data is always in sync: HA Cluster with Linux Containers based on Heartbeat, Pacemaker, DRBD and LXC - Thomas-Krenn-Wiki plus help for docker compose: Creating Highly Available Nodes on ICON — Stage 2: HA Cluster with Nginx and P-Rep Node | by 2infiniti (Justin Hsiao) | Medium

I have to get me some spares rapsi pi and free weekend to test it :slight_smile:

some resources explaining the same as you did from another perspective:

Very interested in this bit what is the service you are using to synch the HA instances over the network. I see several references to a additional resource to synchronize the HA profiles but what is that service? I would very much like to set this up but would need to know what service is used to synch the 2 HA instances since I have 2 boxes with dual NICs already and the guide was extremely helpful in regards to setting up the virtual IP that the clients point to so you aren’t even aware of which HA instance you are currently connected to. I imagine this would work the same with a 3 node cluster although as you stated that might be overkill.

Hello James, thanks for your interest in my system.
There are a few ways to sync folders accross two (or more) nodes, the one I use in my system is probably the simplest and it’s based on rsync and inotifywait.

You need to install inotifywait

sudo apt install inotify-tools

create a bash script hasync.sh

#!/bin/sh
while inotifywait -q -q -r -e modify,create,delete,move /home/homeassistant; do
    rsync -azrhq --log-file=/dev/null --delete --delete-excluded --exclude-from exclude.txt --password-file=rsync_pass /home/homeassistant/ rsync://[email protected]/homeassistant
done

and a file rsync_pass with this content (put a stronger password!)

123456

configure /etc/rsyncd.conf (example below)

lock file = /var/run/rsync.lock
#log file = /var/log/rsyncd.log
log file = /dev/null
pid file = /var/run/rsyncd.pid

[homeassistant]
path = /home/homeassistant
comment = Homeassistant config directory
uid = homeassistant
gid = homeassistant
read only = no
list = yes
auth users = rsyncclient
secrets file = /etc/rsyncd.secrets
hosts allow = 192.168.1.0/255.255.255.0

create /etc/rsycnd.secrets

rsyncclient:123456

sudo chmod 600 /etc/rsyncd.secrets

only root must have write access to secrets file

I also create an rsync exclude.txt file with the folders/file that I don’t want to replicate

.cache
zigbee.db-journal

Then you need to create a systemd service in order to have the sync process running

hasync.service

[Unit]
Description=Home Assistant profile sync utility
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=root
WorkingDirectory=/home/pi
ExecStart=/bin/bash hasync.sh

[Install]
WantedBy=multi-user.target

You need to create two configs, one on node A and one on node B
The one on node A must sync to node B
The one on node B must sync to node A

Only one sync process must be active at a given time and only on the currently active node.

I understand this is not a detailed guide but it may help you with a starting point.

I finally had time to create a personal blog where I will cover all aspects of home assistant running in a cluster,
including the node synchronization and the hardware I designed.

1 Like

Thanks so much for the detailed information. I’m trying to setup failover for HA and there are a few different ways but most involve VMWare, Proxmox or Docker which just complicates things and VMWare 8s extremely expensive,. Especially on the networking side of things plus other factors so the quick feedback and detailed info is extremely appreciated.

Even then. If also sounds like those methods may involve waiting for the failover VM machine to spin up and you lose all recorder history and with this method it appears you lose just the recorder/history that’s currently in memory on the box that goes down which is also a plus for doing it this way IMO. Thanks again!

I also use the strangest password for every site, Passw0rd. That 0 really throws people off and could never be brute forced :slight_smile:

Hello everyone,

I just wanted to tag on to this thread and mention I’ve recently gone down a similar path just as @Lantastic has. I happen to be employed by LINBIT and we provide enterprise support for Pacemaker and Corosync.

My approach differs a bit due to using DRBD (obviously!) to synchronously replicate the filesystem used for storing Home Assisntant’s data in real time between the nodes. Using this solution for mirroring data ensures each node has an exact block-for-block copy of Home Assistant’s filesystem and data. You’ll never lose data that has been written to disk this way during failovers.

I do have some fencing agents for Shelly and Tasmota smart plugs in the works. A very “Home assistant way” to bring fencing (also called STONITH) to Home Assistant high availability.

HomeassistantOS is not one of the supported ones I presume :sweat_smile:

Haha, yeah there are too many ways to install Home Assistant.

HomeassistantOS seemed overkill and bloated at first glance. For someone already running an always-on server in their home, it didn’t make much sense to me. A few lean Docker containers seemed like a no-brainer way to go (for me). Running Home Assistant, MQTT, and ESPHome containers has been quite sufficient so far.

It looks like you could take the HomeassistantOS qcow2/vmdk/img virtual disk image and convert that to an actual block device (physical disk, logical volume, etc) - qemu-img dd -f qcow2 -O raw bs=4096 if=./image.qcow2 of=/dev/sdb or similar command.

Pacemaker can then be configured to mange a highly available virtual machine (with libvirt). DRBD will replicate the entire virtual disk between nodes in real-time. Similar approach, but everything is contained inside a VM.

@ryan-ronnander thanks for bringing your contribution to this thread. I’m glad to see that there is some interest in high availability for Home Assistant.
I went through your blog post and found your approach really interesting.
When I started my project, I evaluated DRBD too, but I ended up with the solution explained above in this thread for a few reasons:

  1. Being a hobby project, I wanted to keep it super simple and super cheap, it is actually made of two Raspberry PIs running on SD storage.
  2. I wanted file sync to be one-way only, from the active node to the standby node
  3. Home Assistant upgrades: during my testing, I found out that when upgrading the stand-by node, file replication must be disabled in order to avoid a sort of race condition between the active node and the node being upgraded.

This is how I upgrade my system; say node A is the currently active one:

  1. put node B on standby, this disables file replication
  2. upgrade Home Assistant on node B
  3. unstandby node B
  4. manual failover, node B becomes the active one, wait until Home Assistant finishes starting up upon upgrade
  5. standby node A, file replication is disabled again
  6. upgrade Home Assistant on node A
  7. unstandby node A
  8. manual failover, node A becomes the active one again

The directory affected by race condition if file replication would stay active during Home Assistant upgrade is /home/homeassistant/.cache