Z-Wave Network Going Offline Every Time a Device Dies

Matthew_MBG · December 14, 2024, 9:42am

I have been having an issue with my Z-Wave JS UI setup for a few weeks now. Every time a device dies (which is multiple times a day due to the distance of the devices), my network dies. This is a big issue as we will be having guests staying here who need to use the Home Automation, and I want to minimise the issues they will have. Any ideas? Thanks in advance.

FloatingBoater · December 14, 2024, 4:01pm

Add more mains powered devices to improve the mesh. If you can’t, manually mark the remote device as dead, or exclude it is the only other way.

The coordinator will attempt to contact a device, and depending on the coordinator firmware I’ve seen this cause a pause in traffic for about two minutes (might be 200S).

ISTR updating the firmware on a Aeotec Gen5 helped slightly - but that was several years ago.

If this helps, this post!

tmjpugh · December 14, 2024, 4:53pm

This device is spamming network and taking it down. As suggested, add mains powered device or repeaters to help.

I like the aeotech range extender

Another option if we are talking another building or super duper far. If there is excellent network access, not just “I think my WiFi can reach it”, you can add a pi with ZwavejsUI installed and run a seperate network.

Matthew_MBG · December 15, 2024, 8:42am

Thank you both @tmjpugh and @FloatingBoater for your suggestions, and I do understand what you are telling me. In truth, it is a large house with slightly over a hundred Z-Wave devices. The house has a basement, ground and first floors, so 3 floors in total, and the reality is that there is the bottleneck of the staircase through which Z-Wave signals have to move up or down the building. I am resigned to the fact that there will alwys be some devices dying, but they used to be successfully brought back to life by the automation that pings them right after they die.

So, to clarify, I’m talking about every device on the edge of the network - not any one specific device. My goal is not to stop them from dying - there will always be that occasional device dying for a few seconds and I understand it is normal. My aim is to stop the network from dying completely for like 2 minutes (or more) every time a device dies. This did not used to happen and started happening since a few weeks ago.

tmjpugh · December 15, 2024, 8:46am

Requirement is the same:

-Check for devices that have high tx/rx retransmits

-Heal those devices in hopes of improving connection

-Add extender ior AC powered device

-depending on distance and causes maybe adding remote zwave network is best

Network doesn’t die because device dies
It dies because the device is alive and having trouble connecting

tmjpugh · December 15, 2024, 9:01am

My home is constructed of metal and brick
Also large size and multi floor

Wi-Fi can’t move room to room in some area
I also have seperate area away from home that I automate

In the home I use many AC powered zwave switches and some rooms I must place zwave extender by door because signals cannot pass through wall

I extended my network to the area separate from my home. Here I added a pi running zwavejsui only. This connects via ethernet/websocket to my main HA

If you have trouble floor to floor maybe add pi running zwavejsui on each floor as Ethernet/wifi may be better between floors. There is no reason to except poor function when real solutions exist . Relocating zwave hub and adding extenders can help also.

PeteRage · December 22, 2024, 1:06am

Devices should not be dying on a regular basis. I have 2 60 node networks, I’ve had exactly 1 device go “dead” in 3 years; and it came back when my automation refreshed it.

You have something wrong. @tmjpugh is giving you good advice. Get a handle on the tx/rx traffic and work to get it down to the bear minimum needed (sensor and frequency of updates). For example, if you have a switch that is reporting voltage, watts, current and you are only using watts disable the other 2 if the devices configuration (or set really long update time or large thresholds).

dzmiller · December 23, 2024, 2:24pm

I agree that the issue may be a rogue node or excessive reporting. I feel that beyond 60 zwave nodes that it’s time to add zigbee and maybe RF. In my house zwave has grown slowly beyond 60 but I don’t rely to an extreme on one technology. I spread the potential pain out.

Zooz in particlar use to be terrible at enabling most reporting functions and spamming the heck out of the zwave network. Many of us with large zwave networks have many series five or older devices that have slow transfer speeds.

Matthew_MBG · December 27, 2024, 8:09pm

I have 1 network with around 120 devices. Unfortunately, buying more Z-sticks and RPIs would get costly (like €300 total), so for now I’m stuck with this.

mterry63 · December 27, 2024, 8:31pm

You can do way better than that price.

Caesar21 · January 25, 2025, 11:02am

First of all, FYI - I am related to the OP, @Matthew_MBG, and I live in the same household. I get it, @tmjpugh , @PeteRage , @dzmiller, what you are saying about network overload, but it is not the case. Most devices that report Power consumption are configured to only report when consumption changes by 50% or sometimes 75% or more, and these are mostly relay switches or switches that handle water pumps or solenoids. The power consumption rarely changes by that much and so reporting is low.

Of course then we have sensors reporting temperature, illuminance, motion, and so forth - but all of those are needed and have automation linked to them. If they were;t necessary I would turn off the reporting, but that does not make sense (e.g. temperature controls heating and cooling, illuminance controls whether a light goes on when there’s motion, etc.) Why else have home automation?

Besides, if I look at the Z-wave log let’s say 2 minutes after a restart, when the network has somewhat settled, I can see lines coming up all the time, but nowhere near what

The thing is this - this problem started in early December or thereabout. It has progressively grown worse, we now restart HA twice daily as the Z-wave network grinds to a halt. The following points are worthy of note:

We have had a few dead devices on the very periphery. Adding an extender will not help - we added 2 already but there was zero change in behaviour. Every room has a minimum of three repeater devices (i.e. powered) but the average would be like 6 repeater devices per room. What is an extender going to change when there are so many devices doing the same thing in every room? It’s basically down to thick concrete walls and even thicker floors and ceilings.
The problem as I said started in December. But why? We did not add devices. We did not re-configure devices. We have not touched anything for months - we’re very happy to not be re-configuring every other day when things are running smoothly. So why now?
We have had dead devices since forever, and an automation that pings them and almost always manages to bring them back to life within seconds. Every time I receive a notification on my phone. It did not lead to instability for a couple of years or so - until now.
Having a dead device bring down the whole network sounds very drastic in my opinion. And it just does not make sense that it should. If a dead device leads to that device dying permanently - I can sort of understand. But bringing down the whole network? That has never happened to us, with Zipato, then Smarthings, nor even with HA before now.
I discovered recently that it also happens when any of us try to go on the Z-wave JS UI page on the Home Assistance mobile app. So I just click Z-wave from the side panel on the right, then get the Z-wave Control Panel page, and it appears to take a while to load… And by a while I mean a long while, maybe half a minute or more. The longer it takes, the more I am sure something is going to go wrong, and sure enough, often it brings down the whole Z-wave network. By the time the devices show up, they all appear grey (with “?”) and it seems to be slowly pinging them all and bringing them back to life. It’s like the whole Z-wave add on restarted. And sometimes it tries 2 or 3 times, continuing to restart, again and again. Finally it starts up, but by then the HA integration says “Failed to unload” and is unable to connect again - the only solution being a full HA restart.

So anyhow, this has been going on for a while. A bit of research led me to this post:

And I seem to be experiencing the same issue this guy is experiencing, and probably it started after a HA core update at around that time - but I don’t remember which was the last version that worked well.

Anyhow - any assistance would be appreciated. Many thanks in advance.

tmjpugh · January 25, 2025, 2:10pm

You’re giving information but no data.

You NEED to confirm data points before deciding next step. This issue may not be zwave related but to confirm that you must check “device statistics” of every device and confirm there are not high “command dropped” rx or tx.

Why December:
-coincidence
-routes magically changed
-some device causing interference
-low battery somewhere

Doesn’t matter unless you can point to a change during that time. When you find a problem it may make sense. Until then, check stats.

Can you clarify this?

I understand this to mean:
-device is working
-device losses connection to network due to environment condition
-device “node status” changes to dead
-this is happening to several devices

If this is true, yes, this will cause HA server to crash. I have had this exact scenario occur and it drove me nuts. It was one zwave device and it caused my server to slow down and randomly crash. Immediately stopped when I removed from network. I was able to confirm this problem by checking the “device statistics”.

A high RTT isn’t great but devices will work, although delayed at times.

High command dropped RX or TX in hundreds isn’t great. It will cause problems. in thousands will cause havoc. Each command blocks another command so if command are missed it causes flood of traffic on network.

Caesar21 · January 25, 2025, 6:25pm

You keep referring to the devices dying, but please note the following:

Your understanding is correct, except that it leaves out the part where an automation immediately pings the dead device and in 99% of the cases brings it back to life almost immediately (at least that is how it was 2 months ago, now all we know is that the device dies, the ping automation fires, then the ZWave network crashes and we have to restart HA - and after the restart, there are no dead devices, so the device “dies” only temporarily). Also - please note that everywhere I say that the Z-Wave network crashes, what is really happening is that Z-Wave JS UI starts, because I see an entry in the log, like so:

[11:21:41] INFO: e[32mService Z-Wave JS UI exited with code 256 (by signal 9)e[0m
[11:21:42] INFO: e[32mStarting the Z-Wave JS UI...e[0m

The fact that devices on the periphery die - and are brought back to life by the ping automation has always been the case for a couple of years, without any major problems. The issue is restricted to about 5 devices on the physical periphery of the network I inserted a mobile app notification into the automation that pings them when they die - and so I know it happens about 2 to 3 times a day. Indeed - it has happened since the very beginning of this HA installation, and from what I saw on this community, it is fairly normal in large installations with lots of concrete walls and ceilings.
You might have missed the part where I said that opening the Z-Wave JS UI interface is causing the whole Z-Wave network to crash, and only a HA reboot recovers from it. That means we have to avoid going to that interface as it almost certainly brings Z-wave down (mostly on mobile phone apps - on a browser it works better).
I checked the RX and TX of all 110 devices. Found the following (after HA and Z-Wave JS UI were restarted about 5 to 6 hours ago) so I suppose these stats are for 5 to 6 hours.

1 device with 2 Timeout Responses
1 device with 3 Dropped RX
4 devices with 1 Timeout Response each
1 device with 29 Dropped RX
1 device with 26 Dropped RX
1 device with 9 Timeout Responses
The Aeotec Z-stick has 2 Timeout ACKs, 58 Dropped RXs in Total Commands, and 1 Dropped RX in Messages.

None of the above scream out to me - the Z-wave network has operated in more dire situations than this without a glitch. Perhaps the ones in italics might raise an eyebrow, but they’re Fibaro Motion sensors, and therefore non-repeaters, and hardly flooding the network, honestly. I cannot imagine these devices causing havoc on the network because of a few dropped RXs.

RTT is usually fine, might take a second for the furthest of devices but the vast majority of devices respond in less than a second. It is negligible really for most devices. I don’t know where to check it as it does not show up in Statistics, but all pings respond nearly instantaneously. This might not be the case in the first couple of minutes after a Home Assistant or Z-Wave JS UI restart when the network takes time to settle down.
I see on GitHub that someone’s reported something: zwave-js-ui crashing after a few days online · Issue #4098 · zwave-js/zwave-js-ui · GitHub. And it seems pretty close to what we are experiencing, except ours is more frequent, perhaps because we have many more devices. And our version is indeed 9.29.1.

So there is the whole story. Hope that helps clarify.

tmjpugh · January 25, 2025, 7:51pm

And in that moment all was OK?

What about waiting longer?
Can you check stats in HA device info page for one of the worse devices when things get bad or this not possible?

Does turning this automation off stabilize things?

Do you have other automations for zwave device connectivity?

PeteRage · January 25, 2025, 9:26pm

My network runs with virtually no dropped TX or RX. In a year I’ve probably had 1. In two years I had 2 instances of dead devices that recovered after a refresh entity call… There is something wrong with your network. Start with the basics:

a) physical stick location, usb hub, extender cable, away from metal, stone, cement, tile, wifi routers, etc.
b) assess the actual data volume by enabling logging and using this script on a days worth of data. Determining what zwave values are creating the most traffic
c) actively exercise your mains powered devices by refreshing entities periodically.