HA in Docker Swarm / high availabilty cluster

Hi all,
I have searched everywhere to find a working setup for Home Assistant in docker swarm. I have tried to combine everything I found into a (somewhat) working setup. I have three nodes to create a high available cluster. HA is connected to a MariaDB and I manage the entire setup with Portainer. The hardest part was the network setup. This worked for me:

  1. create a macvlan_local on every node, with a subnet that is part of the rest of my network ( I have split my subnet into two parts:
    a) ip range .0 - .200 managed by DHCP and
    b) ip range .224/27 managed by docker macvlan
  2. create a macvlan_swarm, based on the local macvlan (--config-from macvlan_local)

docker network ls | grep macvlan

NETWORK ID     NAME                                 DRIVER    SCOPE
xxxxxxx        macvlan_local                        null      local
xxxxxxx        macvlan_swarm                        macvlan   swarm

For the rest;

  • 3 nodes (LibreElec on WETEK hub, Ubuntu 20.04.1 LTS (64bit) on RPi3b and same on a RPi4)
  • Synology NAS with NFS share with docker config
  • docker version 20.10.2
  • deployed/created everything from Portainer

docker-compose

version: '3.7'

services:
  homeassistant:
    restart: always
    image: homeassistant/home-assistant
    volumes:
      - <path to homeassistant config>:/config
      - /etc/localtime:/etc/localtime:ro
    ports:
      - 8123:8123
    depends_on:
      - mariadb_swarm_mariadb
    deploy:
      replicas: 1
    networks:
      - macvlan_swarm
      
networks:
  macvlan_swarm:
    external: true

Things todo are; my Tesla account gets lost sometimes after a failover or node reboot. Same for Nabu Casa. I think both have to do with token created on the 1st ip address and after failover/reboot HA runs at a different ip address/node.

Still; I am happy with the way it works right now. If you are interested in the detailed setup, let me know. If you have any ideas for improving; do let me know as well… :slight_smile:

Screenshot from cluster visualizer in Portainer:

1 Like

Nice setup! I have noticed the same issue with my Nest integration when I run my Home Assistant on a Kubernetes cluster. Keep you posted!

1 Like

I considered running HA with k3s on the 3 RPi3 I have in my house. Ended up frustrated everytime, because:

  • persistent volumes: there is no easy way to establish a shared persistent volume, unless using a central NFS server (which introduces a single-point of failure)
  • HA doesn’t scale. It‘s a monolith and you can only run a single instance on your cluster.

@verguldebarman, how do you manage MQTT in the swarm? Is it standalaone or in a swarm as well? Which software did you use in case of swarm or other HA mode? I’m currently trialing both Swarm and K8 and find that there aren’t many vendor guides on how to run mqtt cluster in Swarm? It’s mostly about K8.

@m0wlheld, what was your CPU utilisation with K8 on RPi and overall responsivness? I’ve got mine on intel J1800 and it averages on 15-20% idle load w/o any Home Assistant on top. Also, all K8 processes seem to occupy about 1G of RAM. Also, on persistent storage. What software did you tried. I’m trialing Gluster with locally mounted volume and it seems fine so far.

I used k3s, a minimized version of K8s along with GlusterFS. K3s does not come with a Gluster storage driver and I failed to use local path.

Sorry guys, I’ve quit using swarm… Back to single nodes. I was testing/changing too much, causing the environment unstable. Perhaps someday… If someone (else) finds a good stable solution

Ps, didn’t use MQTT in the swarm.

Pity - the more add ons exist, the more you add things in HACS, the more something like swarm makes sense. I hope someone (\wave Nabu Casa) takes swarm seriously.

I’m running HA-Supervised, and the container count keeps growing.

root@HA161:~# docker ps | wc -l
24

I’m guessing things have progressed since this thread spun up.

I have been running Home Assistant, containerized on docker swarm for over a year and haven’t had any issues. As what I’m sure will be an added surprise, I haven’t been running in Host mode either - though this is likely to change with the advent of Matter.

My environment:

  • 3 node docker swarm, all running as managers
  • 1 stand alone docker host for my Zigbee and Zwave dongles
  • 1 Synology NAS mounted via NFS

A snip of my stack-hass.yaml

 homeassistant:
    image: homeassistant/home-assistant:2024.12
    hostname: hass
    environment:
     <<: *default-environment
    volumes:
      - /usr/share/zoneinfo/America:/usr/share/zoneinfo/America:ro
      - /usr/syno/ssl:/usr/syno/ssl:ro
      - hass-nfs:/config
    networks:
      - swarm
    ports:
      - published: 8123 #Home Assistant UI
        target: 8123
        protocol: tcp
    deploy:
      <<: *default-replicated
      replicas: 1

The swarm network is set as overlay and attachable == true.
The stand alone docker host is joined to the swarm, so I guess I kinda have 4 nodes, but its node status is set to paused. This allows it to be part of the swarm overlay network, while using a non-swarm docker-compose.yaml. Kinda half pregnant .?.?.?

Also in my stack-hass.yaml are the following services:

  • mqtt
  • ring-mqtt

From ‘real network’ I can hit any of my 3 manager nodes and the ingress routing will put me onto the host currently running Home Assistant. The node it’s running on changes somewhat frequently as I update Ubuntu or Home Assistant - with only a blip of interruption as Home Assistant settles in, after automagically spinning up on one of the other nodes.

I have added avahi to all three nodes, and with this all the mDNS traffic gets to Home Assistant without having to run in MacVLAN, Host, IPVLAN etc mode … all the benefits of the ingress service router without having to do anything overly complex. Google Cast, Home Kit and Matter over IP devices all autodiscover.

As I’ve started playing with Matter & Thread & I have run into many issues I suspect are related to how docker swarm does (doesn’t) handle IPv6. For isntance, sometimes I see my OTBR but the majority of the time it just dissapaers on me shortly after being discovered. I also can not provision a matter device even though I can communicate with the Matter server (spins at the credentials part forever and ever andd ever).

Once I get time to play more with the IPV6 settings, and can confirm it won’t work on the overlay network I’ll move Matter, OTBR and Home Assistant to the HOST network, and run keepalived to track what host is currently running the service - because service routing doesn’t work in host mode.

Hope this makes sense … the real message … it works and it works very well in most situations.
—Rob

I ran docker swarm for a while, and finally moved away from it. Moved to supervised and then… Ironically, I just migrated my entire environment into a HAOS VM. Backed everything up on my supervised install, restored the backup to a HAOS 14.1 install, off and running.

(Debian 11 became unsupported, didn’t want to upgrade in place, risk breaking things)

Truthfully the LEAST painful HA upgrade I’ve done in 3 years. Only thing that broke was the SolArk integration from HACS, and they already published a fix.

Big Picture, I think moving HA to a Docker Swarm / Kube / Podman model makes the most sense longer term. I’d much rather run HA across 2-3-more tiny nodes than a big VM, even with all the underlying redundancy at a KVM level.