First thought is, what is the automation in question and what (config server connection) is it relying on (that is presumably going to sleep)?
I have an HA Blue, on which I run Node-RED as an addon, as my production machine. This has the HA websocket/API nodes all running nicely using the ‘default - Node-RED as addon’ HA server configuration. This is my main machine, and runs a bunch of Node-RED flows most of which are doing something (monitoring more than automation) every few seconds.
I have a Raspberry Pi, on which I run Node-RED for development. This has the HA websocket nodes all running nicely using the ‘non-default’ manual setup for an HA server config, tied back to the HA Blue IP. I do not have the ‘enabled heartbeat’ set, but I do have a few ongoing monitoring and diagnostic flows, mostly listening for HA entity changes.
The Pi has been running without a reboot for several weeks/months consecutively, and I mostly have no issues with it at all, even though it is running headless and connected via WiFi.
I have recently purchased a second HA machine (odroid N2). This is now set up with Node-RED as an addon, and has one HA websocket server running for the local HA [machine 2] using the default ‘as addon’, and another HA websocket server running against the original HA [machine 1] using manual setup, again tied back using the HA1 IP address.
The second machine [2] is being used to run a Maria DB for long term data collection from machine [1], which is done by listening to machine 1 entity changes (via server 1) and then posting to Maria DB (via server 2).
Before launching with machine 2, I ran some basic flows just to monitor entity changes on machine 1. At first I had a number of issues, and found Node-RED regularly crashing due to lack of memory stack. Given that this was a new machine, with almost nothing running, I put this down to my use of WiFi to connect machine 2 to my local network (short on ports on my study switch). Problem solved by moving the machine to near the router where I had a spare ethernet port. Sound network connection, problem solved.
As a cross-tie for monitoring, I am now updating an entity (every 10 seconds) on machine 1 using a Node-RED flow on machine 2, which allows machine 1 to know that machine 2 is running. Machine 2, of course, already knows that machine 1 is running as it is listening to entity state changes. The Rasp Pi just keeps an ear out on machine 1, but I have not yet told the Pi about machine 2, although I do wonder how long it will be before the three of them get organised and become sentient.
So, all my three Node-REDs can connect to one or both of the two HAs, and a short time-loop monitoring flow will tell me if anything stops working. As part of my [1] flows, I am using Modbus (over TCP) to connect to my solar inverter. This uses its own TCP server connection, which quite often goes to sleep. I have several Modbus nodes, all using the same config. The main flow runs a 20 second loop to read most of the registers, so this connection seems to stay alive without issue. I have another flow that interjects only when the inverter timestamp has wondered off more than 20 seconds, and this triggers a time update write, which runs approx. once every three days. This connection can be problematic, and I also have problems with the control command Modbus flow, which attempts to switch inverter modes. Since these only run occasionally, I have resorted to a ‘read-first’ approach so as to wake the connection up before the critical write. I perform a read, then another read if the first read has failed, before doing the critical write, all of which I find is necessary from time to time.
I read the Modbus registers on the inverter in several blocks, and monitor each block read/return using a finite state machine. If any read does not reply within 5 seconds a timer moves the FSM to error state, and this event is then recorded in a circular buffer. This allows me to track the frequency of read failures, as well as complete failure triggering an email message if nothing is seen for 20 seconds.
API calls (Modbus nodes) are one-time and uni-directional, so failure is very much a case of ‘no response’.
Websockets (HA nodes) are permanent and bi-directional, so failure can be better monitored and recovered from.
If your automation failure is down to a machine-machine connection going to sleep, then, for myself, I believe prevention is better than cure (so keep the connection talking regularly) and tight-monitoring is essential to know if and when a connection has failed. Once I know when the connection is failing, I can then begin to consider how best to identify the cause and/or deal with the issue.