Race condition interference in Z-wave

I have a lot of Z-Wave devices. When something triggers a Z-Wave action like a scene change, it appears that many actions are run in quick succession possibly some of them in parallel. This seems to cause some of my Z-Wave devices to go offline occasionally. The only way I have found to clear them is to disconnect the power from the Z-Wave device which usually involves a circuit breaker. This is needless to say annoying and maybe simple serialization of the operations to avoid multiple operations simultaneously could solve it. This is just a theory but it seems to fit the conditions under which the problem occurs.

The issue with ZWave is that unlike Zigbee, the ZWave controller has full control of the network. One of the features built in to ZWave to help keep the network speedy, is a list of dead nodes. When the controller marks a node as dead, it stays dead - the controller will ignore any further requests to communicate with the node, because otherwise the network will significantly slow down while the controller attempts to talk to the dead node.

The downside is that the only way to bring a not-really-dead back to life after this happens, is to unplug the USB stick and let it stay powered down for a minute before plugging it back in.

The node sending a message back to the controller (such as in the event of you powering down and then back up the node) - will temporarily mark the node as alive again on the controller, but usually the controller will not reset the count of how many times it tried to talk to the node and it failed. This means everything works great until the next time the node fails to acknowledge one transmission - then the controller moves it back to the dead node list again.

It doesn’t matter what software you use to speak to the controller, the software has no control over this behaviour. To partially mitigate this issue, the software typically has it’s own dead node list, but this is known as a soft list, the software skips talking to the node quicker than the controller does, with the aim of keeping it off the controllers dead node list, but this obviously has the downside that the node will appear to fail quicker, but with the advantage that when the node sends some data, the software will mark it back alive again, and because it never got added to the controllers dead list, it won’t immediately fail again on the next missed acknowledgement.

So that’s useful information. If I just disable my Z-Wave USB dongle by pulling it out and plug it back in after a moment that will reset all of the devices state. So if it really is the dead list that’s causing the problem that will resuscitate devices that weren’t actually dead correct?

1 Like

One other point that I wanted to make is I chose Z-Wave for some of these devices because the open source solutions for Wi-Fi based plugs and lights don’t work unless you’re willing to take them apart and reprogram their firmware. Which I have done many times, but it’s a pain. The Z-Wave doesn’t have that problem because it’s a standard that we just have to adhere to. Unfortunately it’s an ancient standard and it has bugs as you’re pointing out.

Yup, pretty much.
Been dealing with this issue since the Domoticz days.

I’ve also noticed that there are days that everything is working great, and days where it seems incredibly flaky - and anecdotally - I believe the issue is related to humidity levels.

I will say though - that buying a load of Domitech bulbs when they were on offer some time ago, significantly improved the stability of my network. I now have a ZWave Plus backbone which is massively more stable than the non Plus devices were, and because the Plus has enhancements for neighbour discovery - it makes the rest of the mesh more stable.

I have however been over the last year or 2, replacing my ZWave stuff with Zigbee, and almost never have devices going offline, and was pleasantly surprised at how much more responsive Zigbee is.

On ZWave a ZWve motion sensor detecting motion, telling Home Assistant and Home Assistant sending a command to turn a light on, meant there was often a 2-3 second delay between me entering a room and the light turning on. With Zigbee, it’s been instant.

Wow, that’s good to know. I had no idea that zigbee was better in any significant way. I guess I’ve ignored it because I chose the other Z a long time ago. Maybe I need to open my mind to new possibilities.

I was the same. The main thing that pushed me to give Zigbee a try, was price. I was getting fed up with how much ZWave stuff costs. ~£16-20 will get me a Zigbee LED strip controller. And it’s fantastic. The fact that it is much more responsive was just an added and unexpected bonus.

Next time this happens try pinging the device and see if that fixes the issue. Others have had issues with dead nodes and the solution was to setup an automation that pinged nodes as soon as they went dead.

1 Like

It happened again and I tried the ping and it worked to resolve the offline state of the device. So now I have a workaround at least. Maybe I should just have a an automation that does ping periodically.

1 Like

I have a similar experience. For me it manifests itself in increasing CAN (one of the zwavejs diags increasing). As computers get faster more stuff can be sent to the stick quicker which can result in cascading message collisions and then devices being marked dead.

I find the even a simple use case like a motion sensor triggering an automation that turns a light on causes a collision. Analysis shows that the stick has not completed dealing with the motion sensor before HA responds with the on command. The HA processing happens in microseconds (or faster). So what I’ve done in my automations is insert a small delay before the actions and in between multiple actions. The end result is the automation executes faster because each CAN causes a 100ms delay as zwave retries. For my system a 50ms delay seems to do the trick - which make it 50ms faster because it skips the 100ms retry delay.

Now this doesn’t solve it 100% because if a sensor reading arrives at the same moment the CAN retry will occur. The other thing to look at is making sure there is no extra traffic on your zwave network. Power sensor are notorious for spamming networks. Go through everything and make sure only the required data is being delivered at a slowest rate needed.

It be helpful to see graphs of these zwavejs diagnostics to see if you are getting CAN, Time-outs, RX failures, etc,

What would be super helpful is if the zwavejs integration or the zwavejsui server had the ability to configure an inter TX message delay.

Add delay (0.5 second) between actions for problem nodes.
Are you sure there are not other communication issues for some nodes? If so, improving that may help.

1 Like

I use scenes a lot. I don’t see how i could do the delays between actions. This is a system issue if my theory is correct, and the system should do what is necessary internally to avoid this problem. If delays are needed it should be in Python or JS code not in some heck i do at the user level.

Are you sure you just don’t have communication problems? How did you check this?

1 Like

Not sure. I’m a computer system architect and i work in networking. If the network has corruption or packet loss issues it doesn’t crash it just goes slower. Not so for Z-wave? If this is the case why did anyone use it except in ignorance of this?

Zwave is old protocol

Poor communication (sever packet loss) will cause crash. Suggest you check for RX/TX problems.

You’re making a lot of assumptions about ZWave. Not all of them are correct and leading on a goose chase I’m afraid.

First it’s an ULTRA low power and low speed and very low bandwidth protocol. It doesn’t take much at all to saturate the network. A well plan Ed an dmainted ZWave network can be lightning fast. A poorly performing one can be equally as craptacular.

So unlike ethernet - one chatty device can cause a bunch of collisions and saturate the entire system. (I once had a Jasco switch dying from click of death just spam my network with updates, took out my 90 device network instantly)

There are things you can do to ensure good comms and prevent collosions.

  1. choose a coordinator and ensure its on a supported firmware. (this is ultra important with 700 series sticks as they are known to stall. And… Cause dead nodes.). Do a search for ZWave dead node 700 in this forum and you’ll find the latest articles discussing the issue and what the current recommendation on the firmware is.

In those same threads you’ll find more than one solution to automatically pinging dead nodes. This issue has been going on in one way or another for over two years. So look for current firmware and apply it. It won’t clear the issue entirely but it’s better. I ultimately chose a coordinator based on a 500 seriychip to avoid the issue entirely.

  1. connect the coordinator to the host with an usb extension cord (or even better a powered USB2 hub) to prevent interference from the usb3 bus on the host from causing signal issues.

  2. lots of repeaters. (I use 25’ - about 8m-between nodes for planning) yes the spec says 100’ but walls people dogs cats plants furniture and air are all things that absorb rf signals.

  3. join S0 devices as no security unless you have to for proper function (locks garage doors) With S0 devices they require 3x the comms for each message than thier newer S0 counterparts and only a few chatty S0 devices can wreck a network (one chatty schlage lock yep sure can…)

5)if you use the ZWaveJSUI instead of bog standard zwavejs built in you get a whole host of tools and utilities to help you understand your ZWave mesh and troubshoot.

Right now what I hear are a lot of assumptions about too many calls etc. I would venture to guess your comms are borderline for one reason or another and it doesn’t take much to knock it down. But without logs and signals telling us whats actually happening on the net it’s all guessing.

So what’s the log say about ZWave besides the dead nodes?