High Availability in Home Assistant

Is there any way or guide available for setup high availability to the home assistant?

1 Like

Nearest high avalaibility I have found here for customers so far is a daily backup stored on a separate machine and an similar hardware configuration (NUC) ready with HA image flashed on it ! If NUC dies, disconnect from network and power, plug new one in place, wait initial HA setup and then upload the backup :slight_smile:

1 Like

I used to run 2 proxmox nodes, using shared storage on a NAS drive. It was possible to automate startup of the ha vm on the 2nd node when the first failed. I don’t think it was true ha, but close. Only problem was, I’d then have to physically move the usb stick for zigbee.

Anyhow, energy prices being what they are, I only have one server powered up now (and I ditched proxmox, cos I didn’t like the gui).

1 Like

You can do a little of the way by moving all your devices over on MQTT, because a MQTT broker allows multiple clients to listen to it, so multiple HA installations can use the same MQTT broker and stay up-to-date and they can also send commands that way.
This setup requires hubs/gateways, so you need Zigbee-to-MQTT hub, RF-to-MQTT hub, WiFi-to-MQTT hub and so on.
The problem is that hardware devices are still a single point of failure, because technology and protocols do often not allow multiple hubs.

2 Likes

I am using Proxmox…
Having HA OS in a Virtual Machine - and the Database on another MariaDB Server, it would be easy to clone the existing instance - and with Backups running in Proxmox it is also another level of “high availability”.

Sure… if the server dies, that’s another topic - but I beleive… in this case - I have other concerns.

One important rule in autmation:
Don’t rely on a single point of failure.
Design your System in a way - that - IF HA or your Hardware dies - you are still be able to do the important things manually.

  • Don’t replace your Light switchs with something you can’t switch on physically…
  • Don’t implement things that would lock you out of your house if the server should not respond anymore…

Then, you don’t really need High availability :wink:

2 Likes

my approach for HA requires manual restoration…

  • HAOS VM on a TrueNAS scale

  • daily backup to the HAOS system

  • periodic snapshot to the VM instance

1 Like

That is not really High Availability then, since you have no redundancy only backup.

2 Likes

you’re right, but you can achieve HA if you have 2 VMs on the same machine or better, on a separate one and use a router for Bonding/Multi-chassis Link Aggregation solution

even a low cost Mikrotik switch/router can achieve this solution:
https://help.mikrotik.com/docs/display/ROS/High+Availability+Solutions

there is an older project “HAHA” HomeAssistantHighAvailibilty" but it seems to be no longer maintained.
Based on Docker Swarm.

I do have a guide for configuring a high availability solution for Home Assistant.

I’ve posted the following snippet elsewhere in another thread, but I’ll include it here as well:


Like many of you out there, I have recently found myself more and more reliant on Home Assistant. After looking into Home Assistant high availability I found there aren’t really any options supported by Home Assistant.

So I went ahead and built a solution using existing open source high availability components such as DRBD and Pacemaker. Home Assistant can be made highly available fairly easily using existing tools that have been around for well over a decade. These same tools are used in countless enterprise-grade production deployments. Full disclosure: I am also employed by LINBIT, but I’m not here to sell you anything. These tools are just as free to use as Home Assistant and are part of the open source ecosystem. The required software components can easily be installed through our PPA.

I am personally a big fan of running Home Assistant as a Docker container. My blog post walks you through the process of making Home Assistant highly available using containers. With this solution, Home Assistant’s data is mirrored over the network in real-time (by DRBD, think network-based RAID-1) while Pacemaker controls which node is currently active (replicated filesystem mounted, virtual IP active, and Docker instance(s) running, etc). Failover to a secondary node takes mere seconds once the primary goes down.

I hope to make this topic into an ongoing series with more and more content.

7 Likes

Hey @ryan-ronnander, thanks for posting the blog, I have already tried a test setup and it works as you described, except from the part of migrating from a Virtual Machine (VM) using the official KVM (qcow2) disk image provided by the Home Assistant Team.

When the docker is installed I do not see at the first screen the “Restore from backup” option for some reason. I also tried to scp everything from the config directory of the source server to the /mnt/home-assistant of the primary new server from your example, but there were issues (I guess sql was not happy with such approach).

  1. Any clues on how to migrate to the high available docker setup?
  2. Any ideas on how to use DRBD to use the non-docker setup that I currently have and just add another VM?

Thank you in advance!
Dominicus


20240531 2148 EDIT:

After some further research::

  • There is no Web GUI method implemented to restore from a backup for home assistant docker version.
  • The scp seemed to have worked well (*except that I had to add a new user via CLI hass command)
  • The docker version of home assistant is not supporting add-ons; one way of using ex. Zigbee2MQTT add-on is to make another docker. (ex. Living without add-ons on Home Assistant Container )

Would extending the following code from your blog Home Assistant High Availability and instead of x1 High Available docker having x3 would make any sense? How common is to use multiple dockers in such High Availability configurations?

# Home Assistant resource
primitive p_docker_home-assistant ocf:heartbeat:docker \
    params image="ghcr.io/home-assistant/home-assistant:stable" \
        allow_pull=true \
        name=homeassistant \
        run_opts="--privileged -e TZ=America/Los_Angeles \
        -v /mnt/home-assistant:/config \
        -v /run/dbus:/run/dbus:ro --network=host" \
    op start interval="0s" timeout="120s" \
    op stop interval="0s" timeout="120s" \
    op monitor interval="20s" timeout="100s"

# MQTT resource
primitive p_docker_mqtt ocf:heartbeat:docker \
    params image="eclipse-mosquitto:2.0" \
        allow_pull=true \
        name=mqtt \
        run_opts="-v /mnt/mosquitto-data:/mosquitto \
        -p 1883:1883 -p 9001:9001 \
        --network=host \
        mosquitto -c /mosquitto-no-auth.conf" \
    op start interval="0s" timeout="120s" \
    op stop interval="0s" timeout="120s" \
    op monitor interval="20s" timeout="100s"

# Zigbee2MQTT resource with serial over network
primitive p_docker_zigbee2mqtt ocf:heartbeat:docker \
    params image="koenkk/zigbee2mqtt" \
        allow_pull=true \
        name=zigbee2mqtt \
        run_opts="-e TZ=Europe/Berlin \
        -v /mnt/zigbee2mqtt-data:/app/data \
        -v /run/udev:/run/udev:ro \
        -p 8080:8080 --network=host" \
    environment="ZIGBEE2MQTT__SERIAL__PORT=tcp://192.168.1.12:6638" \
    op start interval="0s" timeout="120s" \
    op stop interval="0s" timeout="120s" \
    op monitor interval="20s" timeout="100s"

Anytime. I have personally never tried to use anything else besides the Docker installation of Home Assistant, but yeah, I imagine a simple copy of the HA config directory can be used to do most of the migration, or a “restore from backup”.

You’re on the right track above with extra containers. In my personal HA installation, I have three containers:

  • Home Assistant
  • MQTT
  • ESPHome

You can modify the Pacemaker configuration from the blog post to run additional containers, grouping all the containers together in a Pacemaker group would be the way to go. Also, there is a resource agent for Docker Compose which could simplify the Pacemaker configuration.

Making a KVM virtual machine highly available (warning: free guide, but gated behind giving up contact information) is also one way to do it.

The blog post is fairly bare-bones to simply show how high availability is possible, but I do plan to revisit it with more of a “real world” configuration.

2 Likes

Hi.

Really interesting guide.
Maybe this can be extended. You are explaining redundancy with two servers, there is a problem in case power failure of one server. In that case quorum will be lost and Pacemaker will bring down all the resources. This can be tested by turning both servers and power up only one server.
To overcome this (not sure is this best practice) is to ignore quorum with “sudo crm configure property no-quorum-policy=ignore”

Thanks. Fencing is the ideal way to go for two node clusters, but you may have also missed the two_node: 1 setting in the corosync.cfg file.

Hi.

I did try with two_node: 1 setting, not sure why it is not working, so changing to no quorum policy did help. As I have zigbee dongle in my setup automatic failover would not help because deploy would fail due to missing device.
Now we come to another question, is it possible to have some dependency on pacemaker that will check is dev connected or something, or to have some script that will try to print /dev/tty folder then in case device is not there prevent bringing up HA and wait till zigbee is connected.

Regards

Hmm, maybe you’re hitting the startup behavior where you’ll need both nodes active, but failover should still function in a two-node cluster. We wrote a KB article on this behavior here: Startup Behavior of a Two-node Pacemaker Cluster

Is it possible to give each node its own Zigbee dongle? I do not personally have experience with Zigbee, everything is WiFi in my deployment. But, but not allowing a node to be promoted, you’re probably just better off not using realtime replication if you can’t failover in realtime.

You might be able to get there with some Pacemaker rules. There is also generic delay resource agent.

I had the same issue with the dongle and HA. Instead start the zigbee dongle in a standalone docker container then add it to HA over a TCP Socket. If you were to have 2 dongles then things will get screwed up due to each interfering with ownership of the network.
I use the Skyconnect so i loaded up the docker image for the addon on a raspberry pi and then when adding the zigbee addon i set the dongle path to tcp://:

Look here: GitHub - b2un0/silabs-multipan-docker: A standalone RPC server based on HomeAssistant's Silicon Labs multiprotocol addon

@ryan-ronnander

Thanks for the blog post. I am following the steps, but I cannot find the package drbd-dkms, getting this error below - however, the drbd-utils installed

Unable to locate package drbd-dkms

The command sudo modprobe drbd && modinfo drbd returns

pi@ha-01:/dev$ sudo modprobe drbd && modinfo drbd
filename: /lib/modules/6.11.0-1004-raspi/kernel/drivers/block/drbd/drbd.ko.zst
alias: block-major-147-*
license: GPL
version: 8.4.11
description: drbd - Distributed Replicated Block Device v8.4.11
author: Philipp Reisner [email protected], Lars Ellenberg [email protected]
srcversion: F89AA075F204021079D1DFD
depends: lru_cache,libcrc32c
intree: Y
name: drbd
vermagic: 6.11.0-1004-raspi SMP preempt mod_unload modversions aarch64
sig_id: PKCS#7
signer: Build time autogenerated kernel key
sig_key: 73:50:3F:A3:89:[redacted]:CB:F1:5F:18
sig_hashalgo: sha512
signature: 42:15:52:BE:0C:FA:8C:EE:95:0B:B5:FA:33:29:FB:[redacted]
parm: allow_oos:DONT USE! (bool)
parm: disable_sendpage:bool
parm: proc_details:int
parm: minor_count:Approximate number of drbd devices (1U-255U) (uint)
parm: usermode_helper:string

I continue the steps without the dkms just to see what happened, and I am getting the error below when I try to configure the disks

pi@ha-01:/dev$ sudo wipefs -afq /dev/sdb
wipefs: error: /dev/sdb: probing initialization failed: No such file or directory

and the error when I try to get the status of the drbdadm

pi@ha-01:/dev$ pi@ha-01:/dev$ drbdadm status
drbd.d/home-assistant.res:7: Parse error: ‘disk | device | address | meta-disk | flexible-meta-disk’ expected,
but got ‘node-id’

Another question

  1. What is the best approach to update the containers? I was considering to use something like Portainer, but not sure if it is supported.