Z-Wave Mesh unresponsive after 4 days - Hail received from node, but still busy with previous one

wvella · June 9, 2022, 7:44am

I’m running an otherwise stable Z-Wave network with ~30 Z-Wave devices in a ZwaveJS2Mqtt (version 6.11.0) docker container running on a Synology NAS. I’m experiencing a strange issue which occurs after 3-4 stable days where all my Z-Wave devices don’t respond to any commands like turning on a switch. Temperature readings from the 4 Z-Wave Sensors are not affected and continue to be logged. The only alert I can see in the log is a Hail received from node, but still busy with previous one... note in the log which feels like the Controller is hung on a previous Hail?

The only way to get it working again is to restart the container and effectively restart the Z-Wave network.

Any ideas?

RickKramer · June 9, 2022, 8:48am

What device(s) have you configured with Hail and why? It creates a lot of overhead on the network.

Hail = device notifies hub that something happened, hub polls, device gives status
Basic = device sends status

wvella · June 9, 2022, 8:34pm

Oh nice one, thanks!I checked and had a few devices that were set to Hail for Association Group 1 for some reason. I’ve set them all to Basic CC Report. Let’s see how it goes over the next few days.

wvella · June 24, 2022, 12:13pm

So this has eliminated the Hail received from node, but still busy with previous one... error but I’m still experiencing the unresponsiveness. I suspect it’s on Z-Wave node that sometimes becomes unresponsive (it’s outside in an electrical box) and the controller is now getting stuck on CNTRLR » [Node 004] pinging the node....

The node becomes responsive after a manual restart of Zwave To MQTT. Is there anyway I can configure a timeout or prevent the controller from getting stuck in this pinging state? In this state, the controller does not respond to any other Z-Wave commands and has to be restarted.

firstof9 · June 24, 2022, 1:28pm

No, that’s not how the controllers work. Even when pinging a node they’ll still process requests.

wvella · June 27, 2022, 10:49am

So this is getting really strange… My Z-Wave network works perfectly until 1 device in the network doesn’t respond anymore. When this happens, the Z-Wave stick no longer responds to any commands for any devices, it does however continue to receive temperature readings from 4 multi-sensors. Once I restart the Z-Wave Docker Container, everything works again and the cycle begins.

I’m starting to think it’s a hardware issue with the stick itself? (It’s 4+ years old so might be a good idea to replace it with a new one?).

PeteRage · June 27, 2022, 12:55pm

You have the Stick on usb extension and hub always from your computer?

firstof9 · June 27, 2022, 1:13pm

Unlikely, provide logs.

wvella · July 2, 2022, 10:10pm

Yeah, I have the stick on a USB extension on top of a server rack cabinet at least 1M away from the NAS.

wvella · July 2, 2022, 10:18pm

This is the last error that was logged before the whole network became unresponsive:

13:45:27.104 CNTRLR   [Node 021] did not respond after 1/3 attempts. Scheduling next try in 500 ms.
13:45:28.616 CNTRLR   Failed to execute controller command after 1/3 attempts. Scheduling next try i
                      n 100 ms.
13:45:29.719 CNTRLR   Failed to execute controller command after 2/3 attempts. Scheduling next try i
                      n 1100 ms.
13:45:31.825 CNTRLR   [Node 021] did not respond after 2/3 attempts. Scheduling next try in 500 ms.
13:45:33.336 CNTRLR   Failed to execute controller command after 1/3 attempts. Scheduling next try i
                      n 100 ms.
13:45:34.441 CNTRLR   Failed to execute controller command after 2/3 attempts. Scheduling next try i
                      n 1100 ms.
2022-07-02 13:45:36.563 ERROR ZWAVE-SERVER: Timeout while waiting for an ACK from the controller (ZW0200)
ZWaveError: Timeout while waiting for an ACK from the controller (ZW0200)
    at Driver.sendMessage (/usr/src/app/node_modules/zwave-js/src/lib/driver/Driver.ts:3525:23)
    at Driver.sendCommand (/usr/src/app/node_modules/zwave-js/src/lib/driver/Driver.ts:3699:28)
    at BinarySwitchCCAPI.set (/usr/src/app/node_modules/zwave-js/src/lib/commandclass/BinarySwitchCC.ts:109:21)
    at Proxy.BinarySwitchCCAPI.<computed> (/usr/src/app/node_modules/zwave-js/src/lib/commandclass/BinarySwitchCC.ts:124:14)
    at ZWaveNode.setValue (/usr/src/app/node_modules/zwave-js/src/lib/node/Node.ts:854:14)
    at Function.handle (/usr/src/app/node_modules/@zwave-js/server/dist/lib/node/message_handler.js:23:44)
    at Object.node (/usr/src/app/node_modules/@zwave-js/server/dist/lib/server.js:40:91)
    at Client.receiveMessage (/usr/src/app/node_modules/@zwave-js/server/dist/lib/server.js:96:99)
    at WebSocket.<anonymous> (/usr/src/app/node_modules/@zwave-js/server/dist/lib/server.js:49:45)
    at WebSocket.emit (node:events:394:28)

Since this error, I’m still receiving updates like this:

2022-07-02 22:07:15.590 INFO **ZWAVE**: Node 12: value updated: 49-0-Humidity 74 => 73

but all commands to all devices are frozen again. When I try to ping a healthy node, it just hangs on this:

22:09:33.251 CNTRLR » [Node 007] pinging the node...

And if I try to turn on a light or a switch, no event or activity is logged in the log file. It feels like it’s hitting an unrecoverable error when a device is not reachable and the listener that receives the commands terminates?

If I restart the ZWave to MQTT docker container, it all comes back and all devices start responding again - until it hits this issue, which is happening daily now.

PeteRage · July 2, 2022, 11:23pm

I think thatt looks like a problem with your stick or USB subsystem? And the exception is not well handled and then is hosing something upstream so it becomes a unrecoverable failure (that is a bug, but not root cause)

The issue start with a binary set coming from HA going to a zwave device,

You’ve done a complete hardware power reset on the whole thing?

I had a similar issue running on docker on a NAS that had not been rebooted in a long time running.

firstof9 · July 3, 2022, 1:28am

Looks like your synology is dropping the USB passthru to your container.

Fix that and your problems will go away.

PeteRage · July 3, 2022, 1:01pm

Here’s a link to the issue I raised with zwave-js to get an interpretation of the first error messages.

I still get 1-3 of these a day but it rarely goes beyond try #1. I was getting ALOT of these, never failed on try 3 like yours. Power cycling the NAS, the USB hub, and swapping USB cable reduced it a lot,

I have two systems with same zwave stick, same USB hub, and differently Synology NAS, both exhibit the same behavior. So in my system I’ve filed this away as “normal behavior”

You running VM or Docker?

wvella · July 6, 2022, 2:53am

Thanks all for the hints, we’ve made some progress on this issue! To isolate the issue, I’ve reconfigured the zwavejs2mqtt docker container to use /dev/ttyACM0 device instead of /dev/zwave. This has completely resolved the issue of dropouts that I was experiencing and so far, no commands have been dropped out of a total 1197 messages.

The question now is - Why am I experiencing this issue when using the symlink alias name? This is the process I used to create the symlink:

Using cat /proc/bus/usb/devices find the Vendor and Product ID of the AEOTEC Z-Wave USB Stick. It will largely be the same across the Gen 5 fleet of USB sticks.
In /lib/udev/rules.d create a new file, such as 50-usb-zwave.rules.
Add the following content:
SUBSYSTEM=="tty", ATTRS{idVendor}=="0658", ATTRS{idProduct}=="0200", SYMLINK+="zwave"

NOTE: I spent a considerable amount of time troubleshooting the udev file above and couldn’t work out why it wasn’t working. It ended up being a copy-and-paste error and the way the quotes were copied.

Run udevadm control --reload-rules && udevadm trigger to reload the udev files (no need to restart the NAS).
Create a scheduled task when the NAS starts by creating the following file: usr/local/etc/rc.d/openhab-zwave-usbpermissions.sh
Add the following content:

#!/bin/sh
chown -R openhab:openhab /dev/ttyACM0
exit 0

PeteRage · July 6, 2022, 12:30pm

On my Synology run 6.x was unable to get rules to work. I just use ttyACM0 like you are, been working fine for multiple years.