70 Zwave device network stuck healing for over 24hrs

I am new to HASS so I just built up my 70 zwave device network. I had to move the RPI4 closer to 3 locks I was trying to include all lost about 20 nodes. This happened twice already. This last time I relocated the RPI4 mid way between the locks and its original location and ran a network heal. The intent was to do so again after moving it to its final location. The hope was that it would ensure the nodes could gradually readjust routes as the RPI4 went from one side of the house to the other.

Anyhow… Zwave JS no longer shows 52/70, but just 70 as if all nodes were fine. I know of nodes that are still dead, and if I try to heal them individually, I get an error stating the network wide heal is still ongoing after over 24hrs.

I get the impression it is stuck. In the logs it looks like the heal entries have stopped. Shall I just reboot HASS?

How do I fix the network? If I just move it back like the first time, I lose the locks, if I don’t, I lose a third of my network.

Try waking all battery powered devices (usually just press a button) and then do the heal again

If you have a lot of devices you may have to wake them as the heal is processing. I’ve found zwavejs wont finish a heal until all battery powered devices come out of being asleep.

1 Like

@mwav3 - Glad to see you on this side of the fence!

I ended up rebooting the host as ZWave JS was reporting all 70 nodes as live but HASS was still telling me healing was ongoing when I tried to heal an individual node as a test. Everything around the house appears to be working and I have already moved the RPI4 closer to its final location. I am a bit afraid to move it one room over in case havoc happens again like the prior times. This time, however, the system did a heal mid way so hopefully the routes should survive the last bit of the relocation.

I don’t have many battery operated Zwave devices as I have always had a better experience with zigbee battery operated sensors. I would have none if Zigbee had equivalent options for all of them… 4-in-1 Zooz sensor for example.

1 Like

Things have been going great since moving, glad you were able to make the switch. I know it’s tough at first but will be worth it in the end.

I only have a few battery powered devices too but see the heal would just stop when it got to them. I saw the suggestion from one of the zwavejs2mqtt developers in github I believe to wake them up which gets things on track.

I’m using zwavejs2mqtt, which is built on zwavejs, but its a little more advanced and does have a nice UI panel which includes a network routing graph. That’s been a huge help with diagnosing issues. I think the regular zwavejs integrates to hassio easier but doesn’t have the UI control panel. Not sure what would be involved with switching it over now or if it would be worth it.

Here’s what the network graph looks like

I’ve read there is a way to make the switch to ZWave2MQTT but I thought the default, and therefore safest option, was the regular ZWave JS. I wonder why they both exist given one can turn off the MQTT portion.

I am really tempted to make the switch… the network map and information on nodes would be of great help! Thanks for sharing!

My understanding is zwavejs came first, and zwavejs2mqtt runs on top of zwavejs to integrate it to mqtt and add the control panel, which makes it usable well beyond home assistant. Think of zwavejs like dos and zwavejs2mqtt like windows, one is built on top of the other. Since zwavejs is the “base”, it probably was easier for Home Assistant to integrate it as the “official” zwave Integration, vs zwavejs2mqtt is a community addon.

To make it more confusing, zwavejs2mqtt can be run in websocket mode with the mqtt turned off (how I run it) but links to Home Assistant using the zwavejs integration. I run homeassistant container so I had to use zwavejs2mqtt because I don’t have the addons, so I had to do everything the hard way.

The advanced zwave configuration documentation does a decent job explaining the setup.

1 Like

Just as a warning, the network map is entirely useless. While it looks nice, the information it shows is simply the node neighbor graph. That is completely unrelated to the actual routing and mesh topography used internally by zwave. The real route your data takes over the mesh can often be totally different from what you see on the graph. It can change on its own depending on signal noise levels, it can route over nodes that are not listed as neighbors in the graph and it can even be asymmetrical (send and receive can go over different routes).

That said, switching to zwavejs is totally worth it. I did it myself recently and now I wonder why I stuck with OpenZWave for so long. Oh and you should definitely use the UI that comes with zwavejs2mqtt.

The integration I am using is the default ZWave JS integration so I need to read up on migrating to ZWave 2 MQTT without a major re-do of the setup as I’ve already lost a few days on that due to issues.

While I haven’t used it in a long time, I do have this tool:

Maybe it is time to test it out again to see if it can help resolve my issues. Right now everything is working fine but I expect havoc to happen again :slight_smile:

ZWaveJS2MQTT. It’s all a bit confusing, I agree. The history is basically that the guy who made ZWave2MQTT, which was and still is unrelated to Home Assistant, previously used OpenZWave as a backend driver. Then OpenZWave went MIA and he switched to zwavejs. Thus ZWave2MQTT became ZWaveJS2MQTT.

This was before HA decided to switch to zwavejs as well. Which was done in a rush, so some suboptimal decisions were taken. There’s all kind of different addons, etc, and a HA native UI is being developed instead of reusing the one already available in ZWaveJS2MQTT. But in the end, they all use the same zwavejs backend driver.

In fact, you can entirely remove all zwave related integrations from HA and only use ZWaveJS2MQTT on its own, which will then communicate over HA with MQTT. That’s what I ended up doing, as I’m trying to reduce dependencies on integrations in HA to an absolute minimum. So there’re lots of different options.

Oh yeah, that will definitely show you the real routes and much more. Needs a bit of knowledge about the internal operations of the zwave protocol to use effectively though.

1 Like

The network map shows the “preferred route”, which might not be the actual route a message takes in a mesh network. So, I wouldn’t say “entirely useless”, but take its information with a grain of salt. The information in the graph could be useful, especially the number of hops, and weaknesses in the network. Another thing I’ve found useful are the logs. There’s two sets of them in zwavejs2mqtt, general and zwave, and they can tell you about problems with the devices.

Showing all the potential network paths would create a messy unusable graph. Here’s a discussion by the developers of zwavejs2mqtt discussing how to represent the network and what they decided. They wanted to avoid a graph, that looks like a “ball of wool”.

I’m sure that zwave toolbox device will tell you way more advanced diagnostic info then zwavejs2mqtt can though.

I tried running it this way and the websockets seemed quicker and more reliable then mqtt. You have to rely on an mqtt broker this way vs the websockets and are adding an additional “middle man” for communication. I’m sure this makes sense depending on your use case. Are you using mqtt for things outside home assistant? If so it definitely would make sense to do it this way. Otherwise I’m not sure what the concerns are around the dependcy on the zwave Integration.

1 Like

But does it ? I just dug into the sources for zwavejs2mqtt and zwavejs to find out.

In order to build the graph, zwavejs2mqtt only uses the node neighbor lists (see ZwaveGraph.vue). Hops and links between nodes are determined by looking at the first neighbor if multiple are present. Which is almost certainly 100% wrong, unless you have a node with only a single neighbor (and even then, it can be wrong, see below). When I do a neighbor update on my zwavejs, the returned lists are sequential node IDs. There is no information about LWR/NLWR (what you would call preferred route), nor any information about signal levels.

To create the neighbor lists, zwavejs calls the ZW_GetRoutingInfo method on the zwave API (see GetRoutingInfoResponse). The zwave API reference does not state any guarantee that the order of returned neighbors are in any way related to the APR/LWR/NLWR. It only states that (section 4.4.15):

ZW_GetRoutingInfo is a function that can be used to read out neighbor information from the protocol.This information can be used to ensure that all nodes have a sufficient number of neighbors and to ensure that the network is in fact one network.

In fact, the standard clearly defines when neighbor lists are used for routing, and that’s almost never (section 3.10.2):

The routing attempts done by a static controller to reach the destination node are as follows:

* If APR, LWR and NLWR all are non-existing and TRANSMIT_OPTION_ACK set. Try direct when neighbors with retries.
* If APR exist and TRANSMIT_OPTION_ACK set. Try the APR. If APR fails then try LWR if it exist and if it also fails then remove the LWR and try direct if neighbor.
* If APR do not exist, LWR exist and TRANSMIT_OPTION_ACK set. Try the LWR. In case the LWR fails, ‘exile’ it to become NLWR and try old NLWR if it exist. if the NLWR also fails, remove it and try direct if neighbor.
* If APR do not exist, LWR do not exist, NLWR exist and TRANSMIT_OPTION_ACK set. Try the NLWR. In case the NLWR fails remove itand try direct if neighbor.

So basically neighbors are only used as a ultimate fallback if everything else fails. And even then it will route through the neighbor with the strongest signal. And that is not available / used in zwavejs and the network graph either.

So in conclusion, what the zwavejs network graph displays has nothing to do with the routing. It’s completely random and should not be relied upon.

Yeah that discussion made me smile. How to display random data in a pretty way :grin:

About the MQTT, there are pros and cons. I’ll reply to this later, I’m on the go right now.

1 Like

Yeah I agree, there’s definitely way more to it then just pulling the first neighbor and making a graph out of it. Zwave routing is a complex topic and it looks like the developers chose to just oversimplify it to make a “pretty” graph.

I guess the question is, can you even make a reliable routing graph with the information that would be available to a zwave management program like zwavejs2mqt? The key to determine routing seems to be in the “routing table” stored on the controller, based on my understanding and this great article - Understanding Z-Wave Networks, Nodes & Devices – Vesternet It seems some information on the zwave protocol is proprietary and seeing the routing tables stored on the controller relies on using the silicon labs own PC controller software. I’ve used that program to update firmware and would say it’s far from “user friendly”.

Thanks for the info and definitely interested to hear your thoughts on MQTT vs Websockets.

1 Like

This thread is turning into a Wikipedia page on ZWave! Interesting read, thanks guys! Great stuff to learn about!

This is not a routing graph.

The network graph is more than useless. It has completely wrong information on it, and caused me to do a lot of damage to a perfectly well functioning zwave network when it said devices were routing through nodes they actually weren’t. Even the neighbor info is generally wrong.

I filed a bug on the zwavemqtt git repo suggesting the network graph be completely removed. It does much more harm than good right now.

Yes, from the additional info and context from @HeyImAlex above I realize that now

Do you have a link? I couldn’t find it on there. I’d be interested in how the developers respond.

Not sure why the developers don’t understand how bad it is.

2 Likes

If it was be a simplified representation then that would be fine. But it’s completely wrong and misleading. And that’s a problem.

Not really, but you can make something more useful. Getting the precise routing requires a sniffer like @aruffell linked to. But even that just gives you a momentary view of the network state and is more useful for deep protocol debugging than for general usage. As far as I see it, it doesn’t make any sense to try representing a mesh network as a fixed hierarchical graph, because that’s not what it is. Mesh networks are mutable by definition and should be represented differently.

Do we really care about what route exactly a message took ? Not really, but we care about things like node congestion, weak paths or isolated nodes. These problems can be measured and represented as statistics over time. Basically collect network stats and aggregated routes over a period of say 24 hours. The representation should then display node connectivity as a type of traffic network. Heavily used routes could be represented as thick arrows, rarely used fallback routes as thin ones. Average signal strength between nodes should be shown. That way you can instantly see where the ‘data highways’ are, which nodes get congested because they end up being the sole router for half of your house. You can also see nodes that are struggling to relay their message due to a weak mesh. A top down hierarchical single connecting graph is just wrong and making it pretty won’t help.

Now it’s true that a lot of that information cannot be gathered by zwavejs on the primary controller. I don’t know if there’s even a way to query the nodes for their respective LWR and NLWR. The controller might not even be aware of certain node routes due to exploratory frames. A zwave sniffer would be needed for that. But what zwavejs could do is make use of signal RSSI. As far as I know, this info is available over the zwave API. The specs are also more or less explicit about how they establish routes based on signal dBm. While there probably are some proprietary details left out, it could be used to display more useful information. Like identifying nodes with weak connections or giving some best effort guesses for the most likely routes taken. Anything would be better than the current graph.

3 Likes

On another note, digging around the zwavejs source I found out why network healing doesn’t always work as expected and sometimes leaves the network in a worse state than before. zwavejs does the healing by using the same flawed neighbor metric than the graph. It establishes a hieearchical connectivity graph based on neighbors (which is almost guaranteed to be wrong) and then starts the healing from the bottom up. The problem with this is that since the graph is wrong, the healing order will not be optimal (read: it will be random) and some nodes will be healed in the wrong order, missing potential repeater nodes in the process.

I mean it’s better than nothing for sure. But a better connectivity metric using signal levels should definitely be high up on their backlog priority.

2 Likes