Some Zwave lights show dead after toggling groups of lights

Hello,

I have home assistant on a raspberry pi 3, with a Aeotec Zwave USB Stick 7, and Aeotec Range Extender 7.

I have about 35 Inovelli VZW31-SN Red series. In Home Assistant when I toggle a group of lights for a room on or off, or change brightness using the entities card (where it has a toggle for the group at the top of the card) it seems to cause some of the zwave lights to go into status “dead”. Like this:
Screenshot 2024-03-23 081758

It’s almost like there’s zwave traffic congestion or something, but I’m only toggling about 5 lights in groups at a time so doesn’t seem like that would be a lot. If I only toggle one light at a time off and on, it doesn’t seem to occur.

Yes, that is a known issue and is due to zwave traffic congestion.

What happens is HA rapidly issues the commands to zwavejs and zwavejs sends them out immediately, this can cause a collision with the response from the prior command, stuff then gets retried and eventually times out. It’d be nice if zwave or HA allowed specification of an inter message delay.

Most folks solve this by using zwave multicast. Search for that in this forum and you’ll find examples.

Another alternative is to create a script that loops through the group that has a small delay. This is the approach I use and have decided that 50ms is about the right delay for my system.

Ok. Thanks for the explanation. I guess I need to figure out how to change the toggles for a group of lights to use a multicast. I have multicast working for automations already.

Could also be a function of the mesh strength. Certainly a weaker mesh is going to be slower and more prone to this sort of behavior. worth evaluating the best practices for stick location, repeaters, etc.

Could also be alot of zwave traffic, energy meters are notorious for spamming the zwave network. More traffic = more chances for collisions = higher probability of timeouts. Take a lot at the RX stats for each device in zwavejsui or enable the diagnostic sensor for each device in HA. Look for the ones that are significantly higher.

Ok. Thanks. I’ll check that out. So far haven’t seen much traffic. My zwave hub and repeater placement is probably not ideal though so I’ll try putting in a more central spot. I just figured it was such low frequency it didn’t matter that much. I noticed that when including some of the far devices they had a really hard time being included. They felt left out :slight_smile:

@PeteRage Thanks for pointing me in the right direction.

I ended up taking a closer look at the debug logs, and the network map in zwave js ui. It appears what is happening is sometimes the zwave mesh seems to be making really stupid decisions on the best route to the controller/zwave hub.

For example if I check throughout the day suddenly two of my light switches will show as having their routes go through a light that is very far from the zwave hub on the far side of the house. The rtt when pinging those devices is very high like 2000ms or greater and z-wave commands seem to time out to these devices. For most devices that show direct route to controller the latency is around 30ms to 50ms rtt. When I test sending automations/zwave commands to these devices it’s super quick with no problems. I also added a 300ms delay to most of my automations to take load off the controller, although it doesn’t help too much with devices that have messed up zwave routes.

If I manually define a priority route and return route directly back to the controller on these problematic lights. The latency is then fixed, and the rtt normally drops to about 50ms to 100ms.

Although this begs the question, should manual routes be needed? If I run a rebuild routes on the controller it doesn’t seem to help and seems to only make the routes worse. I thought the point of the z-wave mesh was that it should make smart decisions on the best route through the mesh.

Here’s my rough understanding.

When the routes are rebuilt, the controller send the node and handful of routes. The node will then start using the first route and will continue using it until it fails. At the point the current route gets pushed to the bottom of the list and it starts using the next route.

In your case, that fast route must fail at some point at which time it is using its next best route, which as you describe is a pretty poor route. And this is a situation where you can take control and create a priority route. Adding some more repeaters in may be a good idea. I speculate that some zwave devices are effected by power line noise. As I see when hvac compressors are running RTT are impacted. Compressors don’t create much high frequency RF noise, but do create a lot of power line noise…

Inwall devices are is close proximity to drywall, wood and in my house metal boxes - hence they usually don’t get selected by the routing. For whatever reason the homeseer HSM200 seem to be the device is at the center of many routes. It has an ok temperature sensor, poor luminance sensor, a reasonable occupancy sensor (thought i wouldn’t use it as a light trigger), and a light. Whereas i have a dedicated zwave repeater that nothing seems to use…

Thanks again! Yes. I was also suspecting that there is some wireless interference occurring at certain times, although it would be very difficult for me to determine the source.

I’ve added a Aeotec Range Extender 7 to one side of the house that is further from the controller. This area has a bunch of inovelli switches that seem to have a lower rtt. Although I’ve noticed with the extender, it can often be stupidly selected as well, by a zwave light switch that is right beside the controller.

I’ll try setting priority routes to devices, like whack a mole, when I see them choosing a bad path.

FYI, these are exactly the kind of “glitches” I was also experiencing as I described in another thread. The “ZWave congestion” can occur simply because the HA software running on the Pi isn’t able to react quickly enough. So timeouts occur and devices are judged incorrectly to be dead, even though they and the ZWave network are working fine. Once this happens, various parts of the software try to “fix” the problem, e.g., by changing routes, but may only make things worse.

In my case, the fix was to move the HA installation from a Pi with 1GB to a Pi with 8GB of RAM. That made all of the “ZWave congestion” incidents disappear, without the need to set routes, use multicast, etc.

If you’re still using a Pi/1GB as an HA host, I suggest trying an alternate host with more RAM. Your problems may also just disappear.

1 Like

Ok. Thanks! I ordered a Raspberry Pi 5 with 4GB of Ram. So hopefully that helps a lot over the 1GB Pi 3b I’m currently running.

I upgraded from Raspberry Pi 3b, to Pi 5 4GB ram model. I can run an automation to issue a zwave “Set device configuration parameter” with about 100ms delay between lights without any retries or errors on 35 inovelli lights. Definitely seems like the Pi 3b just wasn’t fast enough. Also, adding priority routes to devices that should go through the zwave extender, or devices close to the controller seemed to make the zwave mesh a lot more reliable. Thanks again @jh95959 and @PeteRage !

I tried z-wave multicast to the lights but that doesn’t seem to work very well. For some devices I get errors like:

[Node 072] received S2 nonce without an active transaction, not sure what to do with it

I also see in the system logs there seems to be some async errors like this:

File "<string>", line 4, in __init__
  File "/usr/local/lib/python3.12/site-packages/zwave_js_server/model/value.py", line 443, in __post_init__
    self.status = SetValueStatus(self.data["status"])
                                 ~~~~~~~~~^^^^^^^^^^
KeyError: 'status'

So I’ll just not use multicast for now.