Z-Wave JS going offline & high latency

madbrain · March 10, 2025, 4:21am

I am seeing problems with ZUI where the whole Z-Wave network appears to be unavailable. It can be for a few seconds, but sometimes it’s for several minutes, as just happened. This results in things like lights not being turned on by motion sensors. This is what I saw :

The red icon on the top right alternated to green for a while - minutes.
I would like to figure out what’s causing this. Which log should I collect, and how ? And what’s the right place to get it reviewed ?

I am also experiencing other issues, such as temporary dead nodes, inability to include certain devices, and a significant amount of lag - up to 5 seconds for the motion sensor signal to be received, and another 5 seconds to turn on the switch. I suspect there is interference, but am not sure how to narrow it down. The majority of the 69 devices are hardwired Zooz switches, and I can’t easily power them down - breakers work but a lot of other things will probably power down, too. I could exclude the switches temporarily, but not sure that would reduce interference if they are the cause.

madbrain · March 12, 2025, 8:27am

I would really appreciate some help with this. My entire Z-Wave system has fallen apart, because Z-Wave JS goes into “disconnected” status so frequently, and all the devices disappear temporarily. A ton of sensor events are missed as a result. And of course, many commands don’t go through at random.

I checked the logs under Store, both the ZUI and Zwave-JS . I was not able to correlate any entries with the time that the devices all went offline.

PeteRage · March 12, 2025, 9:25am

Tell us about your setup. What computer it’s running on, HAOS, what kind of stick, how long is the extension cable, did it ever work. Put zwavejs into debug mode and post a log.

madbrain · March 12, 2025, 12:30pm

I use HAOS and ZUI in a VMWare VM on a Win11 Pro host. The CPU is an AMD 5700G. The motherboard is an Asus Prime X570 Pro. The host has 32GB of RAM. The VM has 8GB of RAM and 8 cores allocated to it. It has 64GB of virtual disk, running off a SATA SSD.

I use the Zooz ZST39 800LR stick. The extension cable is 10ft.
Things worked mostly fine until about a month ago. The main problems were a few nodes going offline occasionally, mainly based on physical distance. However, I never saw the devices all simultaneously disappear until recently - I would say, in the last couple of weeks.

I wanted to look at the device inventory to post a list, but the control panel once again showed no devices right as I did so at 5:12am PT while posting. It then refreshed with the full device list about a minute later.

I have :

27 Zooz ZEN76 switches
8 Zooz ZEN34 scene controllers
7 Kwikset HC620 locks
6 Zooz ZAC38 repeaters
4 Zooz ZEN77 dimmers
3 First Alert ZCOMBO smoke/CO alarms
3 Zooz ZEN15 smartplugs
2 Ecolink Tiltzwave1 tilt sensors
2 Zooz ZSE11 Q sensors
2 Zooz ZSE18 motion sensors
1 Zooz ZAC36 valve
1 Zooz ZEN16 relay
1 Zooz ZEN32 scene controller
1 Minoston MP22ZP smartplug

Everything is running in traditional Z-wave, except the ZEN16 which is in LR mode.

The most recently added devices are the First alert Zcombo, Kwikset HC620, and Zooz ZEN16 relay. Every other model of device listed above had been in use successfully for well over a year. I did add 14 more switches recently also - the 10 ZEN76 and 4 ZEN77 were all installed this year. All Zooz devices have had their firmware updated to the latest. I don’t recall seeing updates for the other brands.

All devices are running with some sort of security, except for the 2 Ecolink sensors, which have none, and are not even Z-Wave+.
The 3 ZEN15 are using S0 legacy.
Everything else is using S2 (either authenticated, access control, or unauthenticated for those devices that can only do that).

I did notice earlier that there were many power reports from the three ZEN15. I turned of all reporting on them using Z-Wave parameters. That unfortunately did not solve the problem of everything going offline.

I’m not certain of the best place to post a log. For now, I have put it on a google drive. zwavejs_2025-03-12-2.log - Google Drive

PeteRage · March 12, 2025, 7:26pm

The log file is 90%+ meter readings. So these really need to be decreased. Node 135 and Node 86 are good places to start. Think in terms of 10’s of minutes, and disable stuff you don’t need at the device (current, voltage, watts). To much data being transmitted can cause cascading failures.

Regarding the RX stats, the quickest is just to look at each one and write down the number. Then start with the device with the highest RX. Or use the script on this post.

madbrain · March 12, 2025, 10:24pm

Yes, I know about nodes 86 and 135 . These are the ZEN15 I mentioned. I already disabled all reporting from them using Z-Wave parameters 171 through 174, prior to posting the log this morning.

Here is an updated log :

Is there any specific string I can search on to find out when the controller gets disconnected and reconnected ?

Or alternately, an automation I could run to get some sort of timeline ?

freshcoast · March 12, 2025, 11:26pm

Just some advice, don’t use Silly log level. It floods the logs with messages that aren’t useful. Stick to Debug.

PeteRage · March 13, 2025, 12:10am

I still see those nodes reporting. Just search for them “[Node 135]”.

madbrain · March 13, 2025, 1:01am

That’s really strange. The reporting on that node has been disabled for at least 12 hours. Same in node 86 and node 24. All three are the same model, Zooz ZEN15.

madbrain · March 13, 2025, 1:02am

OK. I just switched it. It would be helpful to know if there is a way to look for controller disconnects. I have many old backups and would like to review the logs to find out when it started happening.

freshcoast · March 13, 2025, 2:29am

Are you referring to this?

That’s not your Z-Wave controller, that’s an indication that the websocket connection from your web browser to the ZUI HTTP server has been disconnected. What could be causing that is not so obvious. You might see if the browser developer console (press F12) shows anything. You’ve got a pretty beefy VM allocated for HAOS, but is there any reason it might be hitting some resource limits?

madbrain · March 13, 2025, 3:24am

Thanks. I had no idea that’s what the icon signified. I thought this was about Z-wave disconnects, and that it might correlate with the communication issues. Perhaps the two have nothing to do with each other, then. I will try a different browser. My daily one is Firefox for Windows. The lag and communication problems are unfortunately very much still there. They are worse for the most distant nodes. But even an automation running in the same room as the Z-wave stick has still failed occasionally.

There is no reason I know of why the VM should be overloaded.
Looking at it manually right now, it shows <1% of CPU usage and 1.8GB of RAM usage.

I’m not aware of any long-term log of CPU time or memory in HAOS, but if there is, I’d like to turn it on.

madbrain · March 13, 2025, 4:27am

The controller statistics are as follows .

For the commands, that’s quite a few dropped RX. What could be causing this ?
I there an expected range of errors in terms of percentage of all commands on a working system ? Also, interestingly, there are no dropped TX for either commands or messages.

I’d really like to look at the RSSI for each node, but still can’t find it in ZUI.

freshcoast · March 13, 2025, 5:20am

Commands dropped RX is defined as “No. of commands from the node that were dropped by the host”. It looks like these are currently related to S2 messages that can’t be decoded. Either the message authentication has not yet been established (e.g. there was a Z-Wave JS restart and this is first time communication) or the message was corrupted. Some of these will correspond to log messages that contain cannot decode command. Of those logs, some will always occur after a restart of Z-Wave JS when it talks to a node for the first time (No SPAN is established yet). Other dropped messages will be logged with Dropping message and you’ll probably see that followed by Security2CCMessageEncapsulation error message.

I there an expected range of errors in terms of percentage of all commands on a working system ?

Probably depends, but with an uptime of 2 weeks I have 96 dropped RX (out of 37,759 total).

I just checked why I had 96 dropped and 76 of them were from a single node freaking out and spamming retry messages (Duplicate command), ultimately jamming the controller and causing a soft-reset (which recovered). That was the first time I’ve observed a jammed controller. So excluding that I’d have 20 dropped commands over 2 weeks and 14 of those would be due to a Z-Wave JS restart.

I’d really like to look at the RSSI for each node, but still can’t find it in ZUI.

In ZUI the RSSI is displayed in the Network Graph on the node edges, but I just enable the sensors in HA which gives you long term statistics.

madbrain · March 13, 2025, 5:49am

Thank you ! This is very helpful !
I will look for these messages in my log.

The dropped RX have doubled to 488 since my last post, which was only about an hour ago, when it was 200. So, almost 300 in one more, way more than you are getting in 2 weeks of uptime ! And that is out of 922 total, so 53% of the total commands RX are getting dropped !

On the plus side, I have not seen the “disconnect” icon or all the nodes disappear when using Chrome instead of Firefox. I would still prefer to use FF, having been a developer on it in a previous life.

Thanks for providing a way to turn on the RSSI. I’m going to enable the RSSI sensor on all the nodes. Is there something specific I have to do in HA to keep the values long-term ?

freshcoast · March 13, 2025, 6:07am

I took a quick look at your previous logs, you have a ton of error: Duplicate command logs. Sometimes nodes are spamming them like crazy. For example:

2025-03-12T18:10:49.213Z DRIVER « [Node 087] [REQ] [BridgeApplicationCommand]
                                  │ RSSI: -91 dBm
                                  └─[Security2CCNonceGet] [INVALID]
                                      error: Duplicate command (sequence number 111)

I gave up counting the number of messages, there were more than 100 in a few seconds. That’s going to definitely kill the performance of your network. I’m not sure what’s causing it.

Nope, you don’t have to do anything.

madbrain · March 13, 2025, 6:17am

Thank you very much ! Node 87 is a Zooz ZSE11 Q sensor, located in my master bedroom closet upstairs. I am going to remove the batteries from that sensor for now, and look for other nodes sending the duplicate message in the log, now that I know what to look for.

madbrain · March 13, 2025, 7:07am

So, I had two of the AI chatbots write a Python script to analyze the log - which I won’t post, since it’s against HA community policy, and probably has some edge case bugs.

$ python3 analog.py /d/Downloads/zwavejs_2025-03-12-4.log
Duplicate Command Error Report (Sorted by Error Count):
--------------------------------------------------
Node  87: 235 errors
Node 107: 233 errors
Node   2:  37 errors
Node  57:  28 errors
Node   7:  28 errors
Node   3:  12 errors
Node  75:  12 errors
Node 145:   9 errors
Node  51:   8 errors
Node 102:   7 errors
Node 165:   6 errors
Node 128:   6 errors
Node 122:   5 errors
Node  94:   4 errors
Node 121:   4 errors
Node  89:   3 errors
Node 159:   3 errors
Node 167:   3 errors
Node  59:   3 errors
Node  43:   3 errors
Node  88:   2 errors

87 is a Zooz ZSE11 in the Master bedroom closet upstairs
107 is another Zooz ZSE11 in a half bathroom upstairs
2 is a Zooz ZEN76 in the home theater
57 is a Zooz ZEN76 on the kitchen counter
7 is a Zooz ZEN76 in the home theater
3 is a Zooz ZEN76 in the home theater
Node 75 does not exist ! Yet it is present in the log. Not sure how I’m supposed to track the device in question.
145 is a First Alert Z-combo in my downstairs bedroom
51 is a Zooz ZAC38 range extender in the master bedroom
102 is a Zooz ZSE18 motion sensor in the office, about 4ft from the stick
165 is a Kwikset HC620 on the patio door
128 is a First alert Zcombo in an upstairs bedroom
122 is a Minoston MP22ZP smartplug in the office
94 is a Zooz ZEN76 in the half bathroom
121 is a Zooz ZSE18 in the home theater
89 is a Zooz ZEN76 in the master bedroom
159 is a Kwikset HC620 in the downstairs bedroom
167 is a Zooz ZEN76 in the master bedroom closet
59 is a Zooz ZAC38 range extender in the patio
43 is a Zooz ZAC38 range extender in the garage
88 is a Zooz ZEN76 in the master bedroom

I took the batteries out of the two ZSE11 - nodes 87 and 107.
It’s somewhat surprising they were behaving so badly. Especially since they were within a few feet of nodes 167 and 94 respectively, and those are getting only a few errors.

There are four Zooz ZEN76 that seem problematic - nodes 2, 3, 7 and 57. Not quite sure what to do about them. I could switch them to LR mode, but I don’t think I’ll be able to use multi-cast anymore with the 3 in the home theater if I do that.

PeteRage · March 13, 2025, 4:14pm

In that hour you’ve had 400 rx transmissions which is one every 7 seconds. In contrast on my network I’ve had 600 rx transmission in 4 hours which is one every 24 seconds. My network has zero RX errors.

When there are to many RX at the same time, they collide with each other, cause RX errors, which the ncause retransmissions creating yet more traffic, that causes more collisions, …

Keep working to reduce transmissions, like by 3 or 4x. Also make sure you don’t have automations doing useless operations like turning the light on when it is already on.

hoyt · March 13, 2025, 7:11pm

That seems extremely low. Should we really expect our z-wave network to only be able to handle one update per 30 seconds?