Background
I have duplicated Home-Assistant servers running on Rpi 2’s then 3’s for many years with identical configurations but only running Lovelace with manual operation of devices. I don’t use the Docker implementation of HASSIO - what is the point when hardware is cheap.
I have to admit I am paranoid about server failure (due to of my experience of no-fail systems in my working life in the financial services industry!) and I do not want a single point of failure in my home.
Manual operation of remote devices has been totally successful with one server following another.
Duplicated Automations?
I am now - at last - getting some automations to work to complement manual operations. Things like turning the lights off at night. However, these automations are on one server only at present and might constitute a single point of failure. Whilst it would be no problem to copy the automations files from server 1 to server 2, I do not know if there would be timing race problems between the systems. Despite the servers obtaining time from my local NTP servers, there could be slight discrepancies in automation execution between the HA servers.
Could anyone advise or have experience of duplicated HA servers and automations?
MQTT Servers?
At present, I have a single in-house MQTT server which acts as broker for some of my devices. I perceive this as a single point of failure. I am not sure how a duplicated MQTT server might work. I think it might be possible for duplicated systems to use keepalive daemon to act as Master and Slave with a common virtual IP address and domain name. (I already have other systems working in Master/Slave with keepalived.) I would appreciate some thoughts or experiences on MQTT (I do not want to use a public broker service.).
Active-Active type of redundancy can be tricky.
Did you consider Active-Passive, i.e. identical HA configuration, but only one HA actually running at any given time?
I want a backup RPI4 that is fully the same as the running RPI4 with HA… but when this HA is crashing I simply can change the static IP of the backup RPi4 and reboot… move some USB cables and I’m running again… Then rebuild the production on.
Typically, you would put a load-balancer like HAProxy in front of your HA cluster.
That would be the address used by everything else.
Such load-balancers typically will automatically go to your “backup” HA if the “main” one stops responding.
OFC, as stated, if there are “physical” devices hooked to the main HA (e.g. zigbee / ZWave sticks), they need to be physically moved to the backup one, indeed.
Replication of the config/sqlite db could be implemented with a simple rsync or equivalent.
There might be addons around doing that, but if not, would be semi-trivial to create.
Also obvious, but the setup needs to be regularly switched/tested
Could you explain why you think Active-Active redundancy can be tricky? Is this experience of a problem?
I did not consider Active-Passive HA configuration. As it stands, my two HA servers work quite happily in Active-Active mode and if a condition changes (e.g. if I turn a smart plug on), both will show the change.
I am considering a twice daily cron job (at different times on each HA server) to keep configuration/automation/custom_components/etc files up to date automatically.
Your situation is something I really want to avoid. If services and security fail, and I might not be present, I do not want the hassle of changing hardware and addresses. In my case, the HA servers are just that. All sensors/switches are smart devices within my network(s). I should say I also run duplicated DHCP and BIND servers to run my networks so configuring addresses and domain names are not really a problem for me.
In your case, apart from attached devices, you might consider keepalive in master/slave configuration with a common virtual IP address. That would avoid having to change the static IP addresses while allowing access to each Rpi4 separately. I do this on my Rpi4 PBX servers.
Anyway, I will experiment with duplicated automations in the next few days and see if I have problems - I will have to keep a close eye on the HA logs. Nothing like trying things out and getting fingers burnt!
If it wouldn’t, you wouldn’t have opened this thread
As a first, your potential automation race issue is no more in Active-Passive.
An active-active without load-balancer is only okay-ish as long as nothing targets your HA setup.
Simple example: The mobile apps only support a single HA. How do you (or would you, if you don’t use the apps) that case?
I think that what you are proposing is mostly an intellectual exercise anyway (even if you do implement it), but the way I address single point of failure is by just having all devices controllable without HA. e.g. my z-wave in-wall dimmers instead of smart bulbs. The “redundancy” is in the form of the light switch.
Unless you are a 1337 systems engineer, I suspect that any backup/redundant system would induce
more downtime and issues than it would remedy. My very simple setup fails… essentially never. Literally. The only time anything goes wrong is when I muck with it, and a redundant system would increase the level of mucking by orders of magnitude.
Well I have now been running simple time automations on both servers for nearly 48 hours and all seem to have worked successfully. I have also set up a sensor based automation on both and it doesn’t seem to have a problem. As you will be aware, user devices states are interrogated by by each server and thus both show the same status (give or take a few seconds). This was always the case with manual operation of devices.
As for using a mobile app, I do not use a mobile (cell-phone, handy etc.) for anything other than speaking to people or SMS such is my distrust of the world at large! I have tried using Firefox on a 'phone to control HA but I cannot work with such a tiny screen. So monitoring/manual control is performed on an old fashioned PC (in this case I am writing this text on a 12 Y.O. W7 machine!). If I were to use a mobile app, I don’t see there would be a problem controlling/monitoring a single system while all is functioning correctly. If a single system fails, then the other ought to be accessible using a different address.
Yes, I do think my HA implementation is an ‘intellectual exercise’ but at my age, I have to keep the brain active. Every switch device I have can be physically operated but automation does have benefits for less mobile people. Strictly speaking I don’t really need HA - or come to that, a mobile phone, computer, motor vehicle, online shopping…
Not sure what a 1337 systems engineer is! Anyway, whilst duplicated systems are more complicated, the benefit is that if something goes wrong with configuration/updates/etc. while working on the first system, the other (generally) protects the domestic environment from disruption. Of course there still is the point that there will be single points of failure at sensors and devices attached (by WiFi only in my case) to HA. (My HA servers do not use WiFi).
In Summary
It seems automation works well on duplicated systems so I am happy with that!
Thanks for the input folks. My next step is to make an unburstable MQTT server!
That’s great to hear! I honestly hadn’t considered the “dead simple” option of duplicate servers. I was imagining something much more complex, with multiple nodes, load balancers, and all sorts of sophisticated systems that I don’t understand.
I didn’t mean that disparagingly. A good portion of my home assistant instance and other self-hosted endeavors are more of a hobby than things that are truly worth their “effort.” I meant that even if a redundant server is overkill, it might be fun to do anyway.
1337 => leet => elite, meant in jest.
Godspeed with the redundant mqtt servers, I hope that someone can help you out!
Hi Samuel,
In truth I had forgotten about the subject and the responses therein. Suffice to say that my duplicated HA core systems and MQTT servers have been running trouble free for the last year - apart from ‘finger trouble’ at update time! There do not seem to be any limitations to service and the MQTT servers with keepalive daemon have performed fine.
In the last year I have added a number of devices, mainly Philips Wiz bulbs and ESP8266/ESP32 devices running the Tasmota integration plus several new automations. Again I have to remember to configure both HA systems with the same parameters.
I have successfully added an RTL_433 integration which monitors my central heating oil tank level - and read some neighbours weather systems that they radiate. Had to cut out the tyre monitors for just about every passing car though
So yes I’m happy with things as they are. I have not had a lot of time to devote to HA lately due to my old (12 Y.O.) faithful Windows 7 control centre machine failing and subsequent conversion to a fully Linux home.