Why is Z-wave inclusion so @#$* hard?

And another update: about half of my nodes are now dead (plus a bunch of the “fake” nodes that appear when you try to include a device – but I can’t remove them without losing the real-but-not-currently-speaking nodes too), and I periodically see stretches of alternating “Controller status: controller is ready” and “Controller status: controller is unable to transmit” in my logs. And none of the devices seem to actually function (e.g., have disappeared from dashboards and uncontrollable in the “Devices” section of HA.

Help!!! (And thanks again to everyone helping me with this!)

You might consider changing the title of your thread as you clearly have a network stability problem and not an inclusion problem as the root cause.

You won’t like this advice but send the family out for the day and shut down the network. Pull all the batteries of battery devices and kill power to the mains powered devices and slowly turn things back on from the controller out by distance, watching the debug log in a dedicated window. You shouldn’t get repeated jammed or unable to transmit messages. You may have one or more devices that are causing problems. You could have some external interference in the 900 Mhz range. Without a spectrum analyzer, it will be hard to know for sure.

You could start with the battery powered devices as they aren’t routing devices. Shut them all down and observe the results. The Z-wave-JS UI has a lot of troubleshooting tools built in. In the node map, when you select a node, you’ll get a Diagnose button. Check things from the controller out for the mains powered devices. Diagnosing a battery device will mean keeping it alive, which can be painful depending on device placement.

I wouldn’t go changing things in the Z-wave internals unless I proved I could ace a Z-wave architecture class.

There are 2 things that I think can cause your problem.

  1. What firmware are you running on your controller?
  2. You might have a bad device that is causing interference. I had a switch that was bad and it would flood my network. The reason I was able to diagnose this issue was because I have multiple controllers and everytime the switch would fail all of my networks would go down at once. I couldn’t add or remove devices or control them.
  3. To diagnose whether it was a software issue or hardware issue I switched to the Z-Wave PC Controller software. Its a free software provided by the people who manage the Z-Wave protocol. Using that software I had the same issue so I was able to conclude that it was a hardware issue.
1 Like

Is the extension cable you purchased for the Z-wave dongle powered? I had many of these same problems until I moved the dongle to a USB hub that had its own 5V power source. The conclusion I came to was that the mainboard wasn’t providing enough power to the dongle and thus it didn’t work properly, a simple extension cable for the dongle won’t fix this.

HTH

1 Like

Thank you everyone! I really really appreciate all of the advice!

After several days of waiting for an opportunity to do some more significant testing (along the lines of what @mterry63 suggested), the network miraculously seems to have largely healed: all but a few devices have reconnected and a scan of the logs shows only a about a dozen instances of the “unable to transmit” error (each lasting ~1 second long) in the last 24hrs. I do not see any evidence in the log of any nodes spamming up the network. (@mterry63 I will start a new thread on the “unable to transmit” topic and stick to challenges with inclusion in this thread, thanks for the suggestion.)

However, I’m still having persistent problems including new devices into the network. They seem to (mostly) either time out when trying to include, or time out when responding to the initial S2 authentication. I get lots of dummy nodes appearing for every successful inclusion. (It’s as if it tries to add the node, fails to establish S2 security, creates a node in the Z-wave JS UI list anyway, but it remains dead and often as “unknown manufacturer” / “unknown product”… and then repeats the process.)

@brianmacdonald I will try getting an external power source for the USB dongle (powered hub)
@cornellrwilliams the controller is running FW v1.20

Any other advice?

Additionally, how can I re-include the couple of devices that are still marked as dead (despite being mains-powered devices located physically within a couple feet of devices that are very much alive)? They are hard to move physically (installed light switches and sensors) and given all of the challenges including, I’m hesitant to exclude-then-reinclude…

Thanks again so much for all of the advice!

1 Like

Ok another update… and the bottom line is: yes Z-wave inclusion is terrible (at least in my system).

First: I now believe that the episode where most of my devices appeared as dead is unrelated. I’ve observed that if I change the logging level (e.g., between “verbose” and “debug”), then many of my devices die (it looks like the ones that communicate with the controller via mesh rather than directly) and I start getting intermittent errors about the controller being unable to transmit. This seems to work itself out in a couple of hours to a couple of days (though every time this has happened, 1 or 2 nodes fail to reconnect and remain dead and need to be replaced). (I didn’t make the connection between changing logging level and losing nodes until it happened a second time.) This is deeply concerning but deserves a separate thread, which I will start shortly.

Now back to the challenges with inclusion: this is still miserably poor, and generally still follows the pattern described in my first post. “Smart Start” devices maybe connect after several days of trying (but often don’t even after that), and usually generate a number of “fake” nodes in the process that linger as dead and can’t be removed without a lot of collateral damage. It seems like what is happening is that the devices time out when trying to include (resulting in unsuccessful inclusion and a long wait) or time out when configuring S2 security (resulting in a “fake node” and then starting over again). Occasionally manual inclusion works, though frequently it includes without security (despite selecting “Force security” (!?)), creating a node that must be excluded… then I have to start over again.

And Lately, when I try to exclude these security-less nodes, legitimate unrelated nodes get removed too(!). For example I am trying to include a smart lightswitch. I ran manual inclusion but it timed out when trying to set up S2 security, resulting in a node included without security. I entered Exclusion mode in Z-wave JS UI, put the switch into exclusion mode… and then got a note that a completely unrelated smart plug had been successfully removed. (I was nowhere near the plug and it was definitely not in exclusion mode.)

I have invested so much in this rubbish-looking technology that I would really like to get it working. But I simply don’t see a path forward.

I see zero indication that I have a misbehaving node that is spamming the network: communication with non-dead nodes seems very reliable and the debug logs do not show anything suggesting that any node is transmitting too much. My only hope left is reconnecting the Z-wave dongle via a powered USB hub (in case it was being under-powered by the Home Assistant Green’s USB port), and that should be arriving later today.

Any help would be very appreciated!

Thanks!

I replied to your linked thread.

Honestly, the experience you document is so foreign to any Z-wave issues I’ve experienced I’m not sure where to even start than to point out that “the controller being unable to transmit” is NOT a normal Z-wave experience.

Also, I don’t understand in any way how “inclusion” could occur after a couple of hours to a couple of days. I simply don’t believe the inclusion process will run for that long. The only thing I can imagine is that communications is so poor that it takes that long for the interview process to complete and therefore update the node list.

I’ve never had excluding a single node remove multiple nodes.

Maybe open an issue on the Z-wave-JS UI GitHub and see if anyone there can help identify your unique gremlin.

Thanks @mterry63 it really seems unlike anything I’d expect from a mature technology so I’ve been assuming I just have made some boneheaded configuration error or something… but so far nothing has turned up.

My comment about inclusion happening after a couple days was if I go the “Smart Start” (i.e., QR code) route. It’s fairly common that if I try to include a new device by going (in Z-Wave JS UI) to inclusion, choosing “Scan QR code” and entering the QR code, and then going to the provisioning entries, checking only “S2 Authenticated” and then turning on “Active”… then over the next several days I’ll get a number (usually 1-4/day) artificial fake nodes appear (always without security) and then immediately go dead. (Sometimes they disappear from Z-wave JS UI shortly afterwards, but usually they linger indefinitely.) I usually get a message in Z-Wave JS UI to the effect that “node ### has been included with security none”, sometimes also mentioning either an “unknown error” or a timeout.

Sometimes these “fake nodes” remain listed as “Unknown Manufacturer” / “Unknown product” (and forever show “ProtocolInfo” with a spinning wheel in the interview column of the Z-Wave JS UI table), but other times they show the right product/manufacturer/FW version but never get a node name or location (and forever show “NodeInfo” and a spinning wheel in the Interview column of the Z-Wave JS UI table). Then sometimes after a couple of days the “real” node happily appears. Other times it never does…

(Why my insistence on S2 security? I’ve got a number of existing nodes that were included with S2 security and I’d like to make associations. My understanding is that it’s not possible to make associations between devices using different security levels.)

Sounds more like the interview process is taking an inordinately long time. Some complex devices have a good bit of information to exchange during the interview process, but typically I see this complete in a few seconds.

Have you ever tried to update a nodes firmware? Success or failure of that would go a long way to indicating the network health.

I don’t think S2 security has anything to do with the root problem, it’s just highlighting/aggravating the symptoms. But as I said earlier, your experience is so out of whack with mine, I’m no expert in solving your problem. That’s why I recommended the GitHub route.

Oh interesting. Now that you mention it, I have had a lot of challenges updating node firmware (despite a number of attempts). Hmmm… What does that imply (other than that there are lots of problems :confused: )?

(FWIW the devices I’m trying to add are relatively straightforward, mostly Zooz 800 series switches (Zen71, Zen30, Zen32), Aeotec Multisensor 7’s, or Swidget inserts.)

Inside of Z-Wave JS UI go to the node map > click on a node > then click diagnose to perform a health check. After it’s done it will give you a bunch of information about your device communication. The most important is SNR. If you have a bunch of interference your device will have a negative SNR.

I recommend you try this on multiple devices to get a good idea of what’s going on in your network.

2 Likes

The last few devices I’ve tried to add have been a similar nightmare to OP. Constant retries of inclusion/exclusion. I’ll often get nodes that will finally include, but then take a few days for all of the device entities to show up.

@Onkage I’m (apologetically) glad to see that I’m not alone in this issue… This wasn’t how things started out but as I added more nodes things got really bad.

Here’s an update:

First, @cornellrwilliams I’ve run the “diagnose” command a number of times and have no idea how to interpret what I’m seeing:

Here’s a node 12’ away from my controller (unobstructed line-of-sight):

Here’s one 15’ away through a (wood stud + drywall) wall:

Here are 2 more:


The SNR’s are all positive. Many of these nodes are operational, but the score out of 10 does not inspire confidence… Is this good or bad? What are the takeaways?

More broadly, I’m continuing to see “unable to transmit” messages in my logs. And now it’s taking a long time (several minutes) after opening Z-wave JS UI to see any nodes listed (even my controller)… during which time the icon third-from-right at the top of Z-wave JS UI flashes back and forth between red and green (“disconnected” and “connected”).

So I took drastic steps and tried replacing my controller… and even my Home Assistant box while I was at it. So I’m using a brand new dongle (restored from backup), and a brand new Home Assistant host (HA Yellow rather than Green) based on an entirely different SOC (RPi), with different hardware… and I went ahead and moved the controller to a completely different part of the house – one with VERY little possible of RF interference…
… and I still have the same deeply unreliable z-wave behavior: new nodes do not include (timing out, especially on S2 authentication – across at least 3 different brands of devices), and OTA updates fail as well. And every time I change the logging level, all nodes disconnect and slowly reconnect over the course of a day or two and about 10-15% never reconnect, even after days (note: highly correlated with device type – Aoetec multisensor 7’s are the worst, Zooz ZAC38’s are 2nd worst, Zooz Zen32’s seem more ok, Zooz Zen71s are mixed.)

This is maddening and definitely isn’t giving me the warm fuzzies about Z-wave or any companies/people who promote it :frowning: . Even if I have some bizarrely weird situation going on (and I can’t imagine what the might be), the lack of diagnosability inherent in this technology is pretty unforgivable in a technology like this. (E.g., is my network saturated and if so, by which nodes? Is it a throughput issue? Is it all a controller issue? Is there background interference? etc.?)

I would LOVE to be proven totally wrong and have someone point out that like an idiot I, e.g., forgot to enable the “make z-wave network reliable” option in the 4th settings submenu but so far that hasn’t happened…

Bit of a stretch here… but do you have more than one controller stick plugged in and active at the same time? Pictures 1 and 4 above show “USB Dongle (General)”, while 2 and 3 show “USB Dongle (4 - Mike’s Office)”.

Good idea, but sadly not the case. (These diagnoses were done over several days and I tried relocating the controller during that time (in case there was background interference) and updated its location when I did…)

Without a spectrum analyzer this statement is wishful thinking at best and leading you down the wrong path at worst. Interference doesn’t have to originate from your house in the 900 MHz band. Your troubleshooting efforts to this point help make the case for interference or a lack of a clear channel. You could swap out every component and the result would be the same as the source of the problem is potentially external.

High latency and log errors about unable to transmit align with a congested channel. Z-wave radios can’t transmit at the same time, otherwise they the produce a “collision” which results in scrambled data. Another z-wave node or any transmitter at the same frequency that happens to occur at the same time will trigger collision detection and random back-off waiting to try again. This can occur over and over on a congested channel, resulting in a timeout of the overall stack.

The symptoms of congestion fit your descriptions. If the problem is external radio interference and not a jabbering node, there’s a chance you will never solve it.

@mterry63 Thanks. Fair point – and I really am not an expert on RF analysis!

That said, the new location I moved the controller to is a basement surrounded on 3 sides by thick earth and underneath a concrete slab. Definitely not a Faraday cage, but if the culprit really was background interference then I’d naively expect to see a different failure pattern (e.g., problems with nodes farther out in the main house but better behavior with other nodes in the basement, and hopefully less “failure to transmit” errors) which doesn’t seem to be the case.

I’d be happy to get a simple 900 MHz spectrum analyzer from Amazon (and it even looks like Z-Wave JS UI shows some information on the background if you click on the controller node in network view), but I don’t really know what I’m looking for – e.g., what is “normal” vs problematic readings… Any guidance?

Thank you again for the suggestions!

I’d repeat my guidance to open an issue on Github or ask for help on discord. Buying a scalpel on Amazon won’t make you a surgeon. You need the tool, the knowledge, and the experience. It’s hard to short-circuit that 3rd requirement. :slight_smile:

Old computer guy here, with some RF experience. There’s another phenomena in RF called “desensing” which can cause all sorts of reliability problems. Basically some kind of “transmitter”, even working at a frequency far away from the one you’re using, can “desense” a receiver that is physically close by, so that the receiver can’t hear anything at all while that transmitter is active.

RF Energy falls away rapidly with distance, but often people try to be “neat” and, for example, put all of their “equipment” in the same closet or shelf, where they are very close together. A transmitter can effectively block all signals that a receiver can otherwide receive just fine when that transmitter is not transmitting.

So many devices now use radio. In addition to the 900MHz Zwave interactions, bluetooth, wifi, “baby monitors”, cameras, appliances, etc., etc. can be occasionally transmitting on some frequency even though you’re not really using them or don’t know that they’re using radio.

Desensing isn’t limited to “radio” transmitters. Even power lines generate RF energy, depending on whether or not current is flowing. A Zwave device on a wall but with a power line hiding just behind the drywall might be affected when that power circuit is in use.

I still remember one nasty “reliability” problem I struggled with for weeks with a computer system I was building. Sometimes it worked, sometimes not. It was only when I finally noticed that it worked, or not, depending on whether or not the room lights were on! That was the clue that led to discovering that there was a defective fluorescent tube in the lab’s ceiling lights, sending out a strong RF signal at something like 40 kilohertz IIRC - which was enough to cause the computer problem.

In another case, I finally noticed that a problem was occurring when a plane flew overhead at a certain time of day. I never found out for sure, but I suspected some particular flight had a plane with a radar issue that caused the problem whenever it was arriving for its daily flight. Grounding everything as I should have in the first place made that problem go away.

So just FYI - a 900MHz spectrum analyzer might find the problem, but if not you might think about other RF sources. Of course the problem might be at either end of a 2-way ZWave conversation, so you have to look at the various devices along the ZWave route in addition to the controller itself. One debugging technique is to turn off things (power them down) in the area, even though they’re not “900 MHz devices”, and see if the problems disappear.

Good luck!

One more possibility I forgot to mention. USB3 is a known source of problems as well, even though it’s not “radio”. See for example: USB3.0 Radio Frequency Interference

LANs also now run at such high speeds that the wires can act as antennas and they become RF transmitters. A “1GHz Ethernet” is a 1GHz transmitter.

Your Ethernet and USB cables can be sources of RF interference, even though they don’t look like radio devices.

Good RF engineering practice motivates keeping equipment and wiring well separated and appropriately shielded and/or grounded. Good housekeeping motivates putting all those ugly boxes and wires in one place and out of sight. Often you can’t have both…