How to diagnose a connection failure between HA and Node-RED

benflux · August 21, 2023, 9:05am

I have Node-RED on a Pi and I’ve got HA on the HA Blue.

After either one of these is rebooted (likely HA because there are more frequent updates), I get myself into a situation where I need to move a node and then re-deploy in order to get the automations working again. They then work for a few minutes until the Node-RED automations stop again and I need to click ‘Deploy’ again in order to get the HA automation going again.

Previously this has gone away after a reboot of both but i’m currently in this state so thought someone might have had the same situation or could guide me on how to diagnose?

I question where this should be something I ask on the node-RED forum or here but i’m trying here first.

If this isn’t something someone else has experienced, my question is, how do I diagnose this?

MaxK · August 21, 2023, 11:06am

I have not experienced this situation but I would start with the node red log files.

stevemann · August 21, 2023, 11:32am

As Mark said, look at the Node-Red logs.

Why are you running Node Red on a separate PC? The Node Red Add-on in Home Assistant integrates the two very well. You can use all of your Home Assistant entities in Node Red and you can create Home Assistant entities in Node Red.

Just moving a node in the Node Red panel will trigger the “Deploy” button, but Node Red will continue to run in its current state. A redeploy is only needed if you make new connections or nodes or delete something.

Biscuit · August 21, 2023, 2:34pm

First thought is, what is the automation in question and what (config server connection) is it relying on (that is presumably going to sleep)?

I have an HA Blue, on which I run Node-RED as an addon, as my production machine. This has the HA websocket/API nodes all running nicely using the ‘default - Node-RED as addon’ HA server configuration. This is my main machine, and runs a bunch of Node-RED flows most of which are doing something (monitoring more than automation) every few seconds.

I have a Raspberry Pi, on which I run Node-RED for development. This has the HA websocket nodes all running nicely using the ‘non-default’ manual setup for an HA server config, tied back to the HA Blue IP. I do not have the ‘enabled heartbeat’ set, but I do have a few ongoing monitoring and diagnostic flows, mostly listening for HA entity changes.
The Pi has been running without a reboot for several weeks/months consecutively, and I mostly have no issues with it at all, even though it is running headless and connected via WiFi.

I have recently purchased a second HA machine (odroid N2). This is now set up with Node-RED as an addon, and has one HA websocket server running for the local HA [machine 2] using the default ‘as addon’, and another HA websocket server running against the original HA [machine 1] using manual setup, again tied back using the HA1 IP address.

The second machine [2] is being used to run a Maria DB for long term data collection from machine [1], which is done by listening to machine 1 entity changes (via server 1) and then posting to Maria DB (via server 2).

Before launching with machine 2, I ran some basic flows just to monitor entity changes on machine 1. At first I had a number of issues, and found Node-RED regularly crashing due to lack of memory stack. Given that this was a new machine, with almost nothing running, I put this down to my use of WiFi to connect machine 2 to my local network (short on ports on my study switch). Problem solved by moving the machine to near the router where I had a spare ethernet port. Sound network connection, problem solved.

As a cross-tie for monitoring, I am now updating an entity (every 10 seconds) on machine 1 using a Node-RED flow on machine 2, which allows machine 1 to know that machine 2 is running. Machine 2, of course, already knows that machine 1 is running as it is listening to entity state changes. The Rasp Pi just keeps an ear out on machine 1, but I have not yet told the Pi about machine 2, although I do wonder how long it will be before the three of them get organised and become sentient.

So, all my three Node-REDs can connect to one or both of the two HAs, and a short time-loop monitoring flow will tell me if anything stops working. As part of my [1] flows, I am using Modbus (over TCP) to connect to my solar inverter. This uses its own TCP server connection, which quite often goes to sleep. I have several Modbus nodes, all using the same config. The main flow runs a 20 second loop to read most of the registers, so this connection seems to stay alive without issue. I have another flow that interjects only when the inverter timestamp has wondered off more than 20 seconds, and this triggers a time update write, which runs approx. once every three days. This connection can be problematic, and I also have problems with the control command Modbus flow, which attempts to switch inverter modes. Since these only run occasionally, I have resorted to a ‘read-first’ approach so as to wake the connection up before the critical write. I perform a read, then another read if the first read has failed, before doing the critical write, all of which I find is necessary from time to time.

I read the Modbus registers on the inverter in several blocks, and monitor each block read/return using a finite state machine. If any read does not reply within 5 seconds a timer moves the FSM to error state, and this event is then recorded in a circular buffer. This allows me to track the frequency of read failures, as well as complete failure triggering an email message if nothing is seen for 20 seconds.

API calls (Modbus nodes) are one-time and uni-directional, so failure is very much a case of ‘no response’.
Websockets (HA nodes) are permanent and bi-directional, so failure can be better monitored and recovered from.

If your automation failure is down to a machine-machine connection going to sleep, then, for myself, I believe prevention is better than cure (so keep the connection talking regularly) and tight-monitoring is essential to know if and when a connection has failed. Once I know when the connection is failing, I can then begin to consider how best to identify the cause and/or deal with the issue.

MaxK · August 21, 2023, 5:01pm

I almost chocked on my drink laughing when I read this. Maybe they already are sentient but you don’t know it yet

@benflux Building on Geoff’s thoughts… I use the following flow to check the startup state of Node-Red before doing some automations.

[{"id":"a98465a0152daeee","type":"switch","z":"fe1a8042.af255","name":"State","property":"payload","propertyType":"msg","rules":[{"t":"eq","v":"connecting","vt":"str"},{"t":"eq","v":"connected","vt":"str"},{"t":"eq","v":"states_loaded","vt":"str"},{"t":"eq","v":"services_loaded","vt":"str"},{"t":"eq","v":"disconnected","vt":"str"},{"t":"eq","v":"running","vt":"str"}],"checkall":"true","repair":false,"outputs":6,"x":270,"y":1140,"wires":[[],[],[],["2e51c88f4a7e51e2"],[],[]]},{"id":"ace102795cfc974e","type":"server-events","z":"fe1a8042.af255","name":"HA Startup","server":"f3f422ce.9f2a6","version":2,"eventType":"home_assistant_client","exposeToHomeAssistant":false,"eventData":"","haConfig":[{"property":"name","value":""},{"property":"icon","value":""}],"waitForRunning":true,"outputProperties":[{"property":"payload","propertyType":"msg","value":"","valueType":"eventData"}],"x":80,"y":1140,"wires":[["a98465a0152daeee"]]},{"id":"f3f422ce.9f2a6","type":"server","name":"Home Assistant","version":5,"addon":true,"rejectUnauthorizedCerts":true,"ha_boolean":"y|yes|true|on|home|open","connectionDelay":true,"cacheJson":true,"heartbeat":false,"heartbeatInterval":30,"areaSelector":"friendlyName","deviceSelector":"friendlyName","entitySelector":"friendlyName","statusSeparator":"at: ","statusYear":"hidden","statusMonth":"short","statusDay":"numeric","statusHourCycle":"h23","statusTimeFormat":"h:m","enableGlobalContextStore":true}]

Also, I queue the first message only on selected call service nodes to fire after Node-Red startup.

Could your automation be running before Node-Red is ready?

bplatypus · August 26, 2023, 8:44am

exact same thing for me. I didn’t have the issue before updating to latest HA-nodes version.
Nothing in node-red logs. HA is working ok as well. I really think it is linked to the ha-nodes exposed in node-red.

To make it work again, I just have to move a node in node-red and click on deploy.

My setup is z2m → HA → nodred (everything running using docker). If I trigger a simple switch, I have the ‘action/event’ in z2m and in HA, but nothing is triggered from the HA-node in node-red (I have a simple debug (full msg) bound to the node).

edit: btw non-HA node-red nodes are working fine…

MaxK · August 26, 2023, 11:24am

Can you provide the version number of node-red-contrib-home-assistant-websocket that was working and your current version number (you can find the current version number in the Manage Pallet)?

bplatypus · August 28, 2023, 6:46am

sorry for the late response.

I have version 0.56.0 of node-red-contrib-home-assistant-websocket.
before that I had version 0.41.3. As I moved house and had some renovation work done, I didn’t use HA (&node-red) for more than 4 months. And when I finally began using it, I took the opportunity to update to latest version of HA & node-red. Also updated all nodes in node-red.

Biscuit · August 28, 2023, 9:52am

To work fully, the home-assistant-websocket nodes require matching (ideally the latest) versions of HA, Node-RED, the home-assistant-websocket (pallet) nodes and the node-RED companion integration.

As Node-RED is provided in a modified state for the HA addon, the most up to date version of this includes a bundled matching set of websocket nodes, but not necessarily the latest. Going through the pallet and updating the HA websocket nodes to a newer version manually is possible, but there is always a risk that the latest websocket nodes require something that is not yet provided in Node-RED.

The one component you have not mentioned is the NR companion. My limited understanding is that this integration holds the code that physically executes the connection layer from inside HA, so this is a likely contender for connection issues. I added this integration manually (not HACS) and I typically forget that it, too, has to be updated in line with HA websocket node updates. Major updates to HA websocket nodes have failed for me in the past, as the latest nodes are looking for execution code in the NR companion that is not there.

At the time of writing, the latest Node-RED addon is 14.4.5, which includes the latest HA websocket nodes v0.56.0. The latest companion version is v2.2.0

https://github.com/zachowj/hass-node-red

Finding out the installed version of the companion is tricky. Manually looking for ‘manifest.json’ in config>custom_components>nodered will provide the version number in the file. Alternatively kermit provided me with a simple API call that fetches this information (of course, the catch 22 is that this API call uses the HA websocket to work, and if the websocket is not working…)

See the bottom part of the following post. We were at 1.1.1 less than one year ago!

bplatypus · September 3, 2023, 12:21pm

for those having this problem, I found this setting in the HA server in node-red:

For me it apparently solved the issue… at least it’s up for 2 days now

Never changed it but I don’t know if it was introduced with an update or maybe disabled by an update (?)