I have finally made the transition from HA Core to the latest HA OS on one of my RPi4 HA servers - mainly because upgrading python manually is a pain. I cannot say it was a great experience due to recreating groups, areas, automations, custom configurations and interfacing with other network monitors. It is correctly working in parallel with the older HA Core so if one breaks…
The question that I have is can I:-
a) simply duplicate the HA OS micro SDXC card and search for the instances of hostname?
b) start from scratch with a new image and plug in all the configuration manually?
Its not that easy to transfer service - it’s not an active / passive cluster.
ESPECIALLY with Rpi. There would be a LOT of manual stuff to revert (once you do you will need to manually setup dongles etc that cannot connect to both at once)
So if that has to happen where you would not trust someone else to do those things - you’re not getting remote service restoration.
IF you decide HA (High Availability) is necessary in this case, you need to plan it out with hypervisors that can transfer the workload and avoid dongle connected hardware
So I’m back to - make a backup - transfer backup, shut down original -wait said time, kill original.
HA does not have any included high-availability features, so whilst I understand your sound engineering judgement to avoid SPOFs, a cold-standby is as close as many folks manage.
Certain integrations might manage limited resilience features, but others will fail (e.g. Z-Wave has a single coordinator, Matter/Thread allows several).
The main pitfall is without a network layer balancer or VRRP/ VIPs, hot and cold RPi4s have different MAC addresses, and hence different IPv4 / IPv6 addresses. This causes issues for non-resilient platforms such as ESPhome and MQTT (many small uPs don’t have flash space for mDNS so often use IPv4).
Power switch + RAS and manually remote in to restore an off-HA stored backup? (doesn’t fix hardware like radios)
My cold-standby doubles as a pre-production test bed. Personally, I prefer 2x RPi4 after years of leading teams chasing odd bugs in resilient enterprise networking kit. VMs + snapshots are probably a better solution, but time for a cycle ride cuts into home lab sysadmin hours!
Currently my Home Assistant servers are independent but have the same times/events for automated device functions. I did try using keepalived (i.e. VRRP) on the HA Core servers but that was not successful. Keepalived works well on my RPi PBX systems but that is another story. I run my own ntpsec time servers which prevented time drift on the HA Core. Single points of failure are, of course, the devices such as smart plugs or lights (I like to keep spares and they need to be configured.)
Overall, I am thinking that a new build of HA OS is the best option as the configuration.yaml, groups, areas, automations can be manually copied, I will just have to make sure I don’t make too many mistakes!