Home Assistant running on a high availability - fault tolerant mini cluster made of two Raspberry Pi (Pacemaker, Corosync, PCS, NIC bonding)

My old custom built home software persisted any change of the finite state machine to the database so that it could be resumed in case of a sudden crash.
I’m not sure whether a similar behaviour could be obtained with Home Assistant.
BTW at this time I’m not worried to achieve a complete redundancy that includes the state sync, because I believe that having a second instance of the application ready to start is better than having a complete crash that would require a human intervention and probably some time to be restored.

Here below a simple schematic of my idea to make hardwired sensors redundant.
I will only support I2C and UART as all of my sensors are interfaced with those type of link.

Is there going to be a how-to guide for lesser mortals?

1 Like

Sure!
I can make a quick guide with all the steps that I took to implement my system.
Please take into account that this is far from being a production quality setup.
There is still a lot of work to do but I’m glad that someone is interseted.

2 Likes

Inspiring setup you have there! Love the effort you have gone to so that it is properly HA.

I don’t use Pi myself, I have 2x small DIY tower PCs with a bit more grunt as I run other things too (one as a master, one as a backup - periodically synced to).

I have been planning to configure UCARP on them so that I could quickly flip the IP over should the master die, but never gave any thought to stuff like Corosync!

I believe you could obtain a similar redundant system with your PCs as long as you are running Linux.
If you need to have a virtual IP only, you might want to have a look at Keepalived.

1 Like

As promised, here is a guide on how to setup pacemaker on a couple of Raspberry PIs and run HA in a cluster.
Some prerequisites and assumptions:

English is not my first language.

This is a rough guide based on my notes.
I made many attempts and my progress was based on trial and error so please take this into account.
I will make soon a clean installation starting from scratch and that will be a good chance to write a precise guide (even for myself).

At the time of writing, the Python version on the latest Raspberry OS (Buster) needs to be upgraded in order to run HA core.
How to do this is out of the scope of this guide. I compiled Python 3.9.2 from source.

All installations have to be done on both nodes.

You need to define 3 IP addresses, one for the VIP (virtual ip) and one for each node. These IPs need to be static.
There must be DNS resolution or similar host name to IP resolution.
Since I use OpenWrt on my routers I defined static host names.

On my installation I defined the following:

ha-berry.lan 192.168.64.50 (virtual IP)
node-a.lan 192.168.64.51
node-b.lan 192.168.64.52

Install the Pacemaker stack

sudo apt-get update
sudo apt-get upgrade

reboot Raspberry

sudo apt-get install pacemaker

(this will also install corosync)

Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
  cluster-glue corosync docutils-common fence-agents gawk libcfg7 libcib27 libcmap4
  libcorosync-common4 libcpg4 libcrmcluster29 libcrmcommon34 libcrmservice28 libcurl3-gnutls
  libdbus-glib-1-2 libimagequant0 libjbig0 libknet1 liblcms2-2 liblrm2 liblrmd28 libltdl7
  liblzo2-2 libmariadb3 libnet-telnet-perl libnet1 libnspr4 libnss3 libopenhpi3 libopenipmi0
  libpaper-utils libpaper1 libpe-rules26 libpe-status28 libpengine27 libpils2 libplumb2
  libplumbgpl2 libqb0 libquorum5 libsensors-config libsensors5 libsgutils2-2 libsigsegv2
  libsnmp-base libsnmp30 libstatgrab10 libstonith1 libstonithd26 libtiff5 libtimedate-perl
  libtransitioner25 libvotequorum8 libwebp6 libwebpdemux2 libwebpmux3 libxml2-utils libxslt1.1
  mariadb-common mysql-common openhpid pacemaker-cli-utils pacemaker-common
  pacemaker-resource-agents python3-asn1crypto python3-boto3 python3-botocore
  python3-cffi-backend python3-cryptography python3-dateutil python3-docutils python3-fasteners
  python3-googleapi python3-httplib2 python3-jmespath python3-monotonic python3-oauth2client
  python3-olefile python3-openssl python3-pexpect python3-pil python3-ptyprocess python3-pyasn1
  python3-pyasn1-modules python3-pycurl python3-pygments python3-roman python3-rsa
  python3-s3transfer python3-sqlalchemy python3-sqlalchemy-ext python3-suds python3-uritemplate
  resource-agents sg3-utils sgml-base snmp xml-core xsltproc
Suggested packages:
  ipmitool python3-adal python3-azure python3-keystoneauth1 python3-keystoneclient
  python3-novaclient gawk-doc liblcms2-utils lm-sensors snmp-mibs-downloader crmsh | pcs
  python-cryptography-doc python3-cryptography-vectors docutils-doc fonts-linuxlibertine
  | ttf-linux-libertine texlive-lang-french texlive-latex-base texlive-latex-recommended
  python-openssl-doc python3-openssl-dbg python-pexpect-doc python-pil-doc python3-pil-dbg
  libcurl4-gnutls-dev python-pycurl-doc python3-pycurl-dbg python-pygments-doc
  ttf-bitstream-vera python-sqlalchemy-doc python3-psycopg2 python3-mysqldb python3-fdb
  sgml-base-doc debhelper
The following NEW packages will be installed:
  cluster-glue corosync docutils-common fence-agents gawk libcfg7 libcib27 libcmap4
  libcorosync-common4 libcpg4 libcrmcluster29 libcrmcommon34 libcrmservice28 libcurl3-gnutls
  libdbus-glib-1-2 libimagequant0 libjbig0 libknet1 liblcms2-2 liblrm2 liblrmd28 libltdl7
  liblzo2-2 libmariadb3 libnet-telnet-perl libnet1 libnspr4 libnss3 libopenhpi3 libopenipmi0
  libpaper-utils libpaper1 libpe-rules26 libpe-status28 libpengine27 libpils2 libplumb2
  libplumbgpl2 libqb0 libquorum5 libsensors-config libsensors5 libsgutils2-2 libsigsegv2
  libsnmp-base libsnmp30 libstatgrab10 libstonith1 libstonithd26 libtiff5 libtimedate-perl
  libtransitioner25 libvotequorum8 libwebp6 libwebpdemux2 libwebpmux3 libxml2-utils libxslt1.1
  mariadb-common mysql-common openhpid pacemaker pacemaker-cli-utils pacemaker-common
  pacemaker-resource-agents python3-asn1crypto python3-boto3 python3-botocore
  python3-cffi-backend python3-cryptography python3-dateutil python3-docutils python3-fasteners
  python3-googleapi python3-httplib2 python3-jmespath python3-monotonic python3-oauth2client
  python3-olefile python3-openssl python3-pexpect python3-pil python3-ptyprocess python3-pyasn1
  python3-pyasn1-modules python3-pycurl python3-pygments python3-roman python3-rsa
  python3-s3transfer python3-sqlalchemy python3-sqlalchemy-ext python3-suds python3-uritemplate
  resource-agents sg3-utils sgml-base snmp xml-core xsltproc
0 upgraded, 100 newly installed, 0 to remove and 0 not upgraded.
Need to get 21.5 MB of archives.

sudo apt-get install pcs

user pi needs to be member of the haclient group on both nodes

sudo usermod -a -G haclient pi

pi@node-a:~ $ pcs client local-auth
Username: hacluster
Password:
localhost: Authorized
pi@node-a:~ $ pcs host auth node-a node-b
Username: hacluster
Password:
node-a: Authorized
node-b: Authorized

pcs host auth node-a addr=192.168.64.51 node-b addr=192.168.64.52

sudo pcs cluster setup haberry node-a addr=192.168.64.51 node-b addr=192.168.64.52 --force

Warning: node-a: Running cluster services: 'corosync', 'pacemaker', the host seems to be in a cluster already
Warning: node-a: Cluster configuration files found, the host seems to be in a cluster already
Warning: node-b: Running cluster services: 'corosync', 'pacemaker', the host seems to be in a cluster already
Warning: node-b: Cluster configuration files found, the host seems to be in a cluster already
Destroying cluster on hosts: 'node-a', 'node-b'...
node-b: Successfully destroyed cluster
node-a: Successfully destroyed cluster
Requesting remove 'pcsd settings' from 'node-a', 'node-b'
node-a: successful removal of the file 'pcsd settings'
node-b: successful removal of the file 'pcsd settings'
Sending 'corosync authkey', 'pacemaker authkey' to 'node-a', 'node-b'
node-a: successful distribution of the file 'corosync authkey'
node-a: successful distribution of the file 'pacemaker authkey'
node-b: successful distribution of the file 'corosync authkey'
node-b: successful distribution of the file 'pacemaker authkey'
Synchronizing pcsd SSL certificates on nodes 'node-a', 'node-b'...
node-a: Success
node-b: Success
Sending 'corosync.conf' to 'node-a', 'node-b'
node-a: successful distribution of the file 'corosync.conf'
node-b: successful distribution of the file 'corosync.conf'
Cluster has been successfully set up.

Now try to start the cluster

pi@node-a:~ $ sudo pcs cluster start --all
node-a: Starting Cluster…
node-b: Starting Cluster…

pi@node-a:~ $ sudo pcs status cluster
Cluster Status:
Stack: corosync
Current DC: node-b (version 2.0.1-9e909a5bdd) - partition with quorum
Last updated: Mon Mar 1 22:25:52 2021
Last change: Mon Mar 1 22:25:43 2021 by hacluster via crmd on node-b
2 nodes configured
0 resources configured

PCSD Status:
node-a: Online
node-b: Online

pi@node-a:~ $ sudo pcs status nodes
Pacemaker Nodes:
Online: node-a node-b
Standby:
Maintenance:
Offline:
Pacemaker Remote Nodes:
Online:
Standby:
Maintenance:
Offline:

pi@node-a:~ $ sudo pcs status corosync

Membership information

Nodeid      Votes Name
     1          1 node-a (local)
     2          1 node-b

pi@node-a:~ $ sudo pcs status
Cluster name: haberry

WARNINGS:
No stonith devices and stonith-enabled is not false

Stack: corosync
Current DC: node-b (version 2.0.1-9e909a5bdd) - partition with quorum
Last updated: Mon Mar 1 22:28:43 2021
Last change: Mon Mar 1 22:25:43 2021 by hacluster via crmd on node-b

2 nodes configured
0 resources configured

Online: [ node-a node-b ]

No resources

Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled

Since no STONITH device exists on this cluster (for now) you need to disable STONITH and also disable quorum policy warning

sudo pcs property set stonith-enabled=false
sudo pcs property set no-quorum-policy=ignore

pi@node-a:~ $ sudo pcs property
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: haberry
dc-version: 2.0.1-9e909a5bdd
have-watchdog: false
no-quorum-policy: ignore
stonith-enabled: false

Now you need to create a cluster resoucre for the virtual IP address

sudo pcs resource create virtual_ip ocf:heartbeat:IPaddr2 ip=192.168.64.50 cidr_netmask=32 op monitor interval=5s

pi@node-a:~ $ sudo pcs status resources
 virtual_ip     (ocf::heartbeat:IPaddr2):       Started node-a

You can use the following commands to manually move the resource from one node to another and test that the VIP ownership changes

sudo pcs node standby node-a
sudo pcs node unstandby node-a

sudo pcs node standby node-b
sudo pcs node unstandby node-b

If you want to manually force a failover, you can use this command to stop a node

sudo pcs cluster stop node-a

Now we need to create a resource for the HA service so that it becomes managed by the cluster.
The HA service must remain disabled in systemd so that is the cluster that decides when to start/stop it

pcs resource create homeassistant systemd:[email protected]
pcs resource create clustersync systemd:clustersync.service --group ha_group

to create a group for your resources

sudo pcs resource group add ha_group virtual_ip homeassistant

This guide is missing an additional resource (service) that I use to sychronize the HA profile accross the nodes. I will cover that in another guide.

2 Likes

Any update to the guide? I’d really like to replicate this project. Also, What is the second ethernet adapter you are using?

Apologize for the late reply. The ethernet adapter is a shield based on ENC28J60, you can get it from Amazon. The raspberry PI sees it natively by just enabling the related overlay in /boot/config.txt
dtoverlay=enc28j60

I’m still progressing with this project and developed a dedicated hardware board with support for ZigBee coordinator failover.

Stay tuned!

1 Like

VERY NICE! Following!

Question, are you planning on engineering some type of handoff for USB, for example Zwave with using a USB zwave interface?

@fdlou147 unfortunately I have no plans to support any USB device. My board only has support for ZigBee at this time.

1 Like

That’s fine but still amazing write up and am going to follow. I’m going to use your idea and hope to get a second NIC teamed on my installation

1 Like

Hi,
Would your setup allow to have redundancy of the full system, so if one pi fails the other takes over.
That would include any app, not only ha, running on the pis.
I’m looking how to make my network monitoring pi redundant.
thx
Stefan

As long as your app runs as Linux services it can be managed by Pacemaker and made it redundant on two PIs so that when one goes down it fails over to the other one.
If you have two PIs you can try.

I do have two pi 4.
Each is running pi is 64 bit.
Apps are mostly deployed as docker.

Is there a tutorial showing how to set up redundancy?
Or could share some bullet points on what you did?

sent from a fair mobile

On top of this thread there is a sort of guide I made some time ago, you can refer to that and try to adapt it to your application.
Unfortunately I have no experience with Docker.

1 Like

thx that is helping me, so it is only pacemaker and virtual / cluster IP.
That could even more interesting concept by integrating a shared network storage, so data is always in sync: HA Cluster with Linux Containers based on Heartbeat, Pacemaker, DRBD and LXC - Thomas-Krenn-Wiki plus help for docker compose: Creating Highly Available Nodes on ICON — Stage 2: HA Cluster with Nginx and P-Rep Node | by 2infiniti (Justin Hsiao) | Medium

I have to get me some spares rapsi pi and free weekend to test it :slight_smile:

some resources explaining the same as you did from another perspective:

Very interested in this bit what is the service you are using to synch the HA instances over the network. I see several references to a additional resource to synchronize the HA profiles but what is that service? I would very much like to set this up but would need to know what service is used to synch the 2 HA instances since I have 2 boxes with dual NICs already and the guide was extremely helpful in regards to setting up the virtual IP that the clients point to so you aren’t even aware of which HA instance you are currently connected to. I imagine this would work the same with a 3 node cluster although as you stated that might be overkill.

Hello James, thanks for your interest in my system.
There are a few ways to sync folders accross two (or more) nodes, the one I use in my system is probably the simplest and it’s based on rsync and inotifywait.

You need to install inotifywait

sudo apt install inotify-tools

create a bash script hasync.sh

#!/bin/sh
while inotifywait -q -q -r -e modify,create,delete,move /home/homeassistant; do
    rsync -azrhq --log-file=/dev/null --delete --delete-excluded --exclude-from exclude.txt --password-file=rsync_pass /home/homeassistant/ rsync://[email protected]/homeassistant
done

and a file rsync_pass with this content (put a stronger password!)

123456

configure /etc/rsyncd.conf (example below)

lock file = /var/run/rsync.lock
#log file = /var/log/rsyncd.log
log file = /dev/null
pid file = /var/run/rsyncd.pid

[homeassistant]
path = /home/homeassistant
comment = Homeassistant config directory
uid = homeassistant
gid = homeassistant
read only = no
list = yes
auth users = rsyncclient
secrets file = /etc/rsyncd.secrets
hosts allow = 192.168.1.0/255.255.255.0

create /etc/rsycnd.secrets

rsyncclient:123456

sudo chmod 600 /etc/rsyncd.secrets

only root must have write access to secrets file

I also create an rsync exclude.txt file with the folders/file that I don’t want to replicate

.cache
zigbee.db-journal

Then you need to create a systemd service in order to have the sync process running

hasync.service

[Unit]
Description=Home Assistant profile sync utility
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=root
WorkingDirectory=/home/pi
ExecStart=/bin/bash hasync.sh

[Install]
WantedBy=multi-user.target

You need to create two configs, one on node A and one on node B
The one on node A must sync to node B
The one on node B must sync to node A

Only one sync process must be active at a given time and only on the currently active node.

I understand this is not a detailed guide but it may help you with a starting point.

I finally had time to create a personal blog where I will cover all aspects of home assistant running in a cluster,
including the node synchronization and the hardware I designed.

1 Like

Thanks so much for the detailed information. I’m trying to setup failover for HA and there are a few different ways but most involve VMWare, Proxmox or Docker which just complicates things and VMWare 8s extremely expensive,. Especially on the networking side of things plus other factors so the quick feedback and detailed info is extremely appreciated.

Even then. If also sounds like those methods may involve waiting for the failover VM machine to spin up and you lose all recorder history and with this method it appears you lose just the recorder/history that’s currently in memory on the box that goes down which is also a plus for doing it this way IMO. Thanks again!

I also use the strangest password for every site, Passw0rd. That 0 really throws people off and could never be brute forced :slight_smile:

Hello everyone,

I just wanted to tag on to this thread and mention I’ve recently gone down a similar path just as @Lantastic has. I happen to be employed by LINBIT and we provide enterprise support for Pacemaker and Corosync.

My approach differs a bit due to using DRBD (obviously!) to synchronously replicate the filesystem used for storing Home Assisntant’s data in real time between the nodes. Using this solution for mirroring data ensures each node has an exact block-for-block copy of Home Assistant’s filesystem and data. You’ll never lose data that has been written to disk this way during failovers.

I do have some fencing agents for Shelly and Tasmota smart plugs in the works. A very “Home assistant way” to bring fencing (also called STONITH) to Home Assistant high availability.

1 Like