Home Assistant running on a high availability - fault tolerant mini cluster made of two Raspberry Pi (Pacemaker, Corosync, PCS, NIC bonding)

Hi Everyone!

I’m new to Home Assistant and I’m in the process of migrating my custom home-automation application to Home Assistant.

I have made a mini-cluster with two Raspberry Pi and I’m running Home Assistant on it. My goal is to have a redundant, fault-tolerant system with no single point of failure.
This system should be able to survive a failure of any single component, including network equipment, cabling, etc.
At the moment I’m focusing to the software side, but I’m also developing a hardware interface to connect wired sensors to both nodes.

The cluster is using the well-known stack Pacemaker-Corosync-PCS plus an additional service to sync the Home Assistant profile accross the two nodes.

I have made a video to show the network redundancy and failover of Home Assistant between the two nodes.

The cluster has been running for more than two months with no issues. I upgraded Home Assistant a couple of times on each node without interrupting the service.

Your thoughts or comments are welcome!

8 Likes

Cool! :sunglasses:

Is this also a little youtube clickbait :yum:

Veeeery nice !
A little bit over the top, I like it very much. :slight_smile:

Am I understanding correctly that there is always one instance of HomeAssistant running, and that when it fails your setup will start HA on the other node ?

How do you handle split-brain scenarios with only two nodes ?

I expected this observation which is actually correct.
Yes, there is a risk of a split brain scenario, even though I believe this is mitigated by having a redundant network link.
However, as I said in my previous post, I’m developing an hardware board that will also act as STONITH device and will be capable of hard resetting the nodes. This should make sure that only one instance of HA is running at any given time.
As we know, a real cluster should have 3 or more (odd number) nodes but I believe that it is overkill for this type of application.

What is your thought on state synchronisation between nodes?

Anything that is written to disk can be kept in sync between the two nodes. If the state is something that is only living in memory, then it is lost during a failover.
As I said, I’m new to Home Assistant and I still have a lot to learn.
I will do some testing on this.
Thanks for your point

Would something like vmwave fault tolerant be available in your solution (the way it works)

Can you please be more specific ?

I meant to have a “synchronous copy” of the entire homeassistant including addons. E.g. I run homeassistant as a vm on esxi and one on proxmox. Vmware supports the real-time replication of an entire machine from one host to another including every cpu cycle. I know, this is very specific and also :money_mouth_face::money_mouth_face::money_mouth_face:.

https://kb.vmware.com/s/article/1013428

I wondered if your setup works “alike”

I believe that is something that only VMWare or a virtualization software can do.

I’m actually even more curious to see your redundant hardware board then this ! :smiley:

I never even thought it was possible to run a second NIC on a RPi.

I’m also frankly surprised it has been working for several months already. I would have thought it wouldn’t be possible to run a HA in high-availability (because you’d need state synchronization.)
It must not have been easy to make it work, kudos !

My old custom built home software persisted any change of the finite state machine to the database so that it could be resumed in case of a sudden crash.
I’m not sure whether a similar behaviour could be obtained with Home Assistant.
BTW at this time I’m not worried to achieve a complete redundancy that includes the state sync, because I believe that having a second instance of the application ready to start is better than having a complete crash that would require a human intervention and probably some time to be restored.

Here below a simple schematic of my idea to make hardwired sensors redundant.
I will only support I2C and UART as all of my sensors are interfaced with those type of link.

Is there going to be a how-to guide for lesser mortals?

1 Like

Sure!
I can make a quick guide with all the steps that I took to implement my system.
Please take into account that this is far from being a production quality setup.
There is still a lot of work to do but I’m glad that someone is interseted.

2 Likes

Inspiring setup you have there! Love the effort you have gone to so that it is properly HA.

I don’t use Pi myself, I have 2x small DIY tower PCs with a bit more grunt as I run other things too (one as a master, one as a backup - periodically synced to).

I have been planning to configure UCARP on them so that I could quickly flip the IP over should the master die, but never gave any thought to stuff like Corosync!

I believe you could obtain a similar redundant system with your PCs as long as you are running Linux.
If you need to have a virtual IP only, you might want to have a look at Keepalived.

1 Like

As promised, here is a guide on how to setup pacemaker on a couple of Raspberry PIs and run HA in a cluster.
Some prerequisites and assumptions:

English is not my first language.

This is a rough guide based on my notes.
I made many attempts and my progress was based on trial and error so please take this into account.
I will make soon a clean installation starting from scratch and that will be a good chance to write a precise guide (even for myself).

At the time of writing, the Python version on the latest Raspberry OS (Buster) needs to be upgraded in order to run HA core.
How to do this is out of the scope of this guide. I compiled Python 3.9.2 from source.

All installations have to be done on both nodes.

You need to define 3 IP addresses, one for the VIP (virtual ip) and one for each node. These IPs need to be static.
There must be DNS resolution or similar host name to IP resolution.
Since I use OpenWrt on my routers I defined static host names.

On my installation I defined the following:

ha-berry.lan 192.168.64.50 (virtual IP)
node-a.lan 192.168.64.51
node-b.lan 192.168.64.52

Install the Pacemaker stack

sudo apt-get update
sudo apt-get upgrade

reboot Raspberry

sudo apt-get install pacemaker

(this will also install corosync)

Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
  cluster-glue corosync docutils-common fence-agents gawk libcfg7 libcib27 libcmap4
  libcorosync-common4 libcpg4 libcrmcluster29 libcrmcommon34 libcrmservice28 libcurl3-gnutls
  libdbus-glib-1-2 libimagequant0 libjbig0 libknet1 liblcms2-2 liblrm2 liblrmd28 libltdl7
  liblzo2-2 libmariadb3 libnet-telnet-perl libnet1 libnspr4 libnss3 libopenhpi3 libopenipmi0
  libpaper-utils libpaper1 libpe-rules26 libpe-status28 libpengine27 libpils2 libplumb2
  libplumbgpl2 libqb0 libquorum5 libsensors-config libsensors5 libsgutils2-2 libsigsegv2
  libsnmp-base libsnmp30 libstatgrab10 libstonith1 libstonithd26 libtiff5 libtimedate-perl
  libtransitioner25 libvotequorum8 libwebp6 libwebpdemux2 libwebpmux3 libxml2-utils libxslt1.1
  mariadb-common mysql-common openhpid pacemaker-cli-utils pacemaker-common
  pacemaker-resource-agents python3-asn1crypto python3-boto3 python3-botocore
  python3-cffi-backend python3-cryptography python3-dateutil python3-docutils python3-fasteners
  python3-googleapi python3-httplib2 python3-jmespath python3-monotonic python3-oauth2client
  python3-olefile python3-openssl python3-pexpect python3-pil python3-ptyprocess python3-pyasn1
  python3-pyasn1-modules python3-pycurl python3-pygments python3-roman python3-rsa
  python3-s3transfer python3-sqlalchemy python3-sqlalchemy-ext python3-suds python3-uritemplate
  resource-agents sg3-utils sgml-base snmp xml-core xsltproc
Suggested packages:
  ipmitool python3-adal python3-azure python3-keystoneauth1 python3-keystoneclient
  python3-novaclient gawk-doc liblcms2-utils lm-sensors snmp-mibs-downloader crmsh | pcs
  python-cryptography-doc python3-cryptography-vectors docutils-doc fonts-linuxlibertine
  | ttf-linux-libertine texlive-lang-french texlive-latex-base texlive-latex-recommended
  python-openssl-doc python3-openssl-dbg python-pexpect-doc python-pil-doc python3-pil-dbg
  libcurl4-gnutls-dev python-pycurl-doc python3-pycurl-dbg python-pygments-doc
  ttf-bitstream-vera python-sqlalchemy-doc python3-psycopg2 python3-mysqldb python3-fdb
  sgml-base-doc debhelper
The following NEW packages will be installed:
  cluster-glue corosync docutils-common fence-agents gawk libcfg7 libcib27 libcmap4
  libcorosync-common4 libcpg4 libcrmcluster29 libcrmcommon34 libcrmservice28 libcurl3-gnutls
  libdbus-glib-1-2 libimagequant0 libjbig0 libknet1 liblcms2-2 liblrm2 liblrmd28 libltdl7
  liblzo2-2 libmariadb3 libnet-telnet-perl libnet1 libnspr4 libnss3 libopenhpi3 libopenipmi0
  libpaper-utils libpaper1 libpe-rules26 libpe-status28 libpengine27 libpils2 libplumb2
  libplumbgpl2 libqb0 libquorum5 libsensors-config libsensors5 libsgutils2-2 libsigsegv2
  libsnmp-base libsnmp30 libstatgrab10 libstonith1 libstonithd26 libtiff5 libtimedate-perl
  libtransitioner25 libvotequorum8 libwebp6 libwebpdemux2 libwebpmux3 libxml2-utils libxslt1.1
  mariadb-common mysql-common openhpid pacemaker pacemaker-cli-utils pacemaker-common
  pacemaker-resource-agents python3-asn1crypto python3-boto3 python3-botocore
  python3-cffi-backend python3-cryptography python3-dateutil python3-docutils python3-fasteners
  python3-googleapi python3-httplib2 python3-jmespath python3-monotonic python3-oauth2client
  python3-olefile python3-openssl python3-pexpect python3-pil python3-ptyprocess python3-pyasn1
  python3-pyasn1-modules python3-pycurl python3-pygments python3-roman python3-rsa
  python3-s3transfer python3-sqlalchemy python3-sqlalchemy-ext python3-suds python3-uritemplate
  resource-agents sg3-utils sgml-base snmp xml-core xsltproc
0 upgraded, 100 newly installed, 0 to remove and 0 not upgraded.
Need to get 21.5 MB of archives.

sudo apt-get install pcs

user pi needs to be member of the haclient group on both nodes

sudo usermod -a -G haclient pi

pi@node-a:~ $ pcs client local-auth
Username: hacluster
Password:
localhost: Authorized
pi@node-a:~ $ pcs host auth node-a node-b
Username: hacluster
Password:
node-a: Authorized
node-b: Authorized

pcs host auth node-a addr=192.168.64.51 node-b addr=192.168.64.52

sudo pcs cluster setup haberry node-a addr=192.168.64.51 node-b addr=192.168.64.52 --force

Warning: node-a: Running cluster services: 'corosync', 'pacemaker', the host seems to be in a cluster already
Warning: node-a: Cluster configuration files found, the host seems to be in a cluster already
Warning: node-b: Running cluster services: 'corosync', 'pacemaker', the host seems to be in a cluster already
Warning: node-b: Cluster configuration files found, the host seems to be in a cluster already
Destroying cluster on hosts: 'node-a', 'node-b'...
node-b: Successfully destroyed cluster
node-a: Successfully destroyed cluster
Requesting remove 'pcsd settings' from 'node-a', 'node-b'
node-a: successful removal of the file 'pcsd settings'
node-b: successful removal of the file 'pcsd settings'
Sending 'corosync authkey', 'pacemaker authkey' to 'node-a', 'node-b'
node-a: successful distribution of the file 'corosync authkey'
node-a: successful distribution of the file 'pacemaker authkey'
node-b: successful distribution of the file 'corosync authkey'
node-b: successful distribution of the file 'pacemaker authkey'
Synchronizing pcsd SSL certificates on nodes 'node-a', 'node-b'...
node-a: Success
node-b: Success
Sending 'corosync.conf' to 'node-a', 'node-b'
node-a: successful distribution of the file 'corosync.conf'
node-b: successful distribution of the file 'corosync.conf'
Cluster has been successfully set up.

Now try to start the cluster

pi@node-a:~ $ sudo pcs cluster start --all
node-a: Starting Cluster…
node-b: Starting Cluster…

pi@node-a:~ $ sudo pcs status cluster
Cluster Status:
Stack: corosync
Current DC: node-b (version 2.0.1-9e909a5bdd) - partition with quorum
Last updated: Mon Mar 1 22:25:52 2021
Last change: Mon Mar 1 22:25:43 2021 by hacluster via crmd on node-b
2 nodes configured
0 resources configured

PCSD Status:
node-a: Online
node-b: Online

pi@node-a:~ $ sudo pcs status nodes
Pacemaker Nodes:
Online: node-a node-b
Standby:
Maintenance:
Offline:
Pacemaker Remote Nodes:
Online:
Standby:
Maintenance:
Offline:

pi@node-a:~ $ sudo pcs status corosync

Membership information

Nodeid      Votes Name
     1          1 node-a (local)
     2          1 node-b

pi@node-a:~ $ sudo pcs status
Cluster name: haberry

WARNINGS:
No stonith devices and stonith-enabled is not false

Stack: corosync
Current DC: node-b (version 2.0.1-9e909a5bdd) - partition with quorum
Last updated: Mon Mar 1 22:28:43 2021
Last change: Mon Mar 1 22:25:43 2021 by hacluster via crmd on node-b

2 nodes configured
0 resources configured

Online: [ node-a node-b ]

No resources

Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled

Since no STONITH device exists on this cluster (for now) you need to disable STONITH and also disable quorum policy warning

sudo pcs property set stonith-enabled=false
sudo pcs property set no-quorum-policy=ignore

pi@node-a:~ $ sudo pcs property
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: haberry
dc-version: 2.0.1-9e909a5bdd
have-watchdog: false
no-quorum-policy: ignore
stonith-enabled: false

Now you need to create a cluster resoucre for the virtual IP address

sudo pcs resource create virtual_ip ocf:heartbeat:IPaddr2 ip=192.168.64.50 cidr_netmask=32 op monitor interval=5s

pi@node-a:~ $ sudo pcs status resources
 virtual_ip     (ocf::heartbeat:IPaddr2):       Started node-a

You can use the following commands to manually move the resource from one node to another and test that the VIP ownership changes

sudo pcs node standby node-a
sudo pcs node unstandby node-a

sudo pcs node standby node-b
sudo pcs node unstandby node-b

If you want to manually force a failover, you can use this command to stop a node

sudo pcs cluster stop node-a

Now we need to create a resource for the HA service so that it becomes managed by the cluster.
The HA service must remain disabled in systemd so that is the cluster that decides when to start/stop it

pcs resource create homeassistant systemd:[email protected]
pcs resource create clustersync systemd:clustersync.service --group ha_group

to create a group for your resources

sudo pcs resource group add ha_group virtual_ip homeassistant

This guide is missing an additional resource (service) that I use to sychronize the HA profile accross the nodes. I will cover that in another guide.

2 Likes

Any update to the guide? I’d really like to replicate this project. Also, What is the second ethernet adapter you are using?

Apologize for the late reply. The ethernet adapter is a shield based on ENC28J60, you can get it from Amazon. The raspberry PI sees it natively by just enabling the related overlay in /boot/config.txt
dtoverlay=enc28j60

I’m still progressing with this project and developed a dedicated hardware board with support for ZigBee coordinator failover.

Stay tuned!

1 Like

VERY NICE! Following!

Question, are you planning on engineering some type of handoff for USB, for example Zwave with using a USB zwave interface?