Finally cracked a method to have a hot-start backup instance of HA

rusty_away · January 20, 2023, 4:34pm

I’ve searched this forum and other Internet sources for a way to have two identical HA instances set up in a fail-over configuration. The original goal was to have shared resources but that proved to be all too hard so, in the end, I had to settle for a pretty close second best method.

The short version:

My main instance runs in a VM on Proxmox, which is hosted on a Ryzen-based NUC. The limited storage on the NUC means that my Proxmox backups are sent to my TrueNAS server over a NFS share.

My alternate host is a second hand HPE DL380P Gen 8 server that I picked up for a song. It came with 25 SAS drives which I used for my second TrueNAS server. I added 4 x 1TB SSDs in a ZFS pool which host my second Proxmox instance. This Proxmox instance has one HA VM whose only job is to monitor various status aspects of my production HA instance, ready to trap any failures. It then initiates responses depending on what has failed.

The error trapping includes pinging the 1st Proxmox instance and the production HA VM, checking the status of a mqtt loopback setup whereby the error trapping HA instance publishes a topic that is looped back via the production HA instance and is subscribed to by the error trapping instance. In a similar manner, the error trapping HA instance also checks the mqtt status of a light that is hosted by the production HA instance. I simply subscribed to the light/tele/STATUS topic of the light. If the mqtt server on the production instance fails, then the testing instance immediately senses the disconnection and initiates action.

All of these tests go through a matrix that determines whether the production HA supervisor, or the production HA VM, or the host Proxmox server has failed. Depending on the output of the matrix I attempt different solutions. The error trapping also includes checks to ensure that the server actually has power and also includes delays and checks to ensure that the production HA instance isn’t simply being restarted.

I use SSH to connect when initiating the actions. If restarting the supervisor, or restarting the VM fail to work, or if the Proxmox instance has failed, I SSH into the host Proxmox of my failover server and use “qm restore” to restore from the same backup on the NFS share (which is updated every night at 3AM). I use a single command to start the restore process, nominate the VM number and start the VM when the restore is finished. This command is sent using the “Exec” node in node red. No scripts are required,

Lastly my testing HA instance checks the status of the “new” HA instance and if all is OK, I shut it down so that it doesn’t monitor the new HA instance (which could result in an endless loop.)

This method works perfectly with a total downtime of just over 10 minutes, most of which is taken up with the extraction of the backup file.

If there is any interest in this method I will make the effort to create a detailed guide.

Cheers.