70 Zwave device network stuck healing for over 24hrs

HeyImAlex · August 24, 2021, 9:24am

But does it ? I just dug into the sources for zwavejs2mqtt and zwavejs to find out.

In order to build the graph, zwavejs2mqtt only uses the node neighbor lists (see ZwaveGraph.vue). Hops and links between nodes are determined by looking at the first neighbor if multiple are present. Which is almost certainly 100% wrong, unless you have a node with only a single neighbor (and even then, it can be wrong, see below). When I do a neighbor update on my zwavejs, the returned lists are sequential node IDs. There is no information about LWR/NLWR (what you would call preferred route), nor any information about signal levels.

To create the neighbor lists, zwavejs calls the ZW_GetRoutingInfo method on the zwave API (see GetRoutingInfoResponse). The zwave API reference does not state any guarantee that the order of returned neighbors are in any way related to the APR/LWR/NLWR. It only states that (section 4.4.15):

ZW_GetRoutingInfo is a function that can be used to read out neighbor information from the protocol.This information can be used to ensure that all nodes have a sufficient number of neighbors and to ensure that the network is in fact one network.

In fact, the standard clearly defines when neighbor lists are used for routing, and that’s almost never (section 3.10.2):

The routing attempts done by a static controller to reach the destination node are as follows:

* If APR, LWR and NLWR all are non-existing and TRANSMIT_OPTION_ACK set. Try direct when neighbors with retries.
* If APR exist and TRANSMIT_OPTION_ACK set. Try the APR. If APR fails then try LWR if it exist and if it also fails then remove the LWR and try direct if neighbor.
* If APR do not exist, LWR exist and TRANSMIT_OPTION_ACK set. Try the LWR. In case the LWR fails, ‘exile’ it to become NLWR and try old NLWR if it exist. if the NLWR also fails, remove it and try direct if neighbor.
* If APR do not exist, LWR do not exist, NLWR exist and TRANSMIT_OPTION_ACK set. Try the NLWR. In case the NLWR fails remove itand try direct if neighbor.

So basically neighbors are only used as a ultimate fallback if everything else fails. And even then it will route through the neighbor with the strongest signal. And that is not available / used in zwavejs and the network graph either.

So in conclusion, what the zwavejs network graph displays has nothing to do with the routing. It’s completely random and should not be relied upon.

Yeah that discussion made me smile. How to display random data in a pretty way

About the MQTT, there are pros and cons. I’ll reply to this later, I’m on the go right now.

mwav3 · August 24, 2021, 1:26pm

Yeah I agree, there’s definitely way more to it then just pulling the first neighbor and making a graph out of it. Zwave routing is a complex topic and it looks like the developers chose to just oversimplify it to make a “pretty” graph.

I guess the question is, can you even make a reliable routing graph with the information that would be available to a zwave management program like zwavejs2mqt? The key to determine routing seems to be in the “routing table” stored on the controller, based on my understanding and this great article - Understanding Z-Wave Networks, Nodes & Devices – Vesternet It seems some information on the zwave protocol is proprietary and seeing the routing tables stored on the controller relies on using the silicon labs own PC controller software. I’ve used that program to update firmware and would say it’s far from “user friendly”.

Thanks for the info and definitely interested to hear your thoughts on MQTT vs Websockets.

aruffell · August 24, 2021, 2:38pm

This thread is turning into a Wikipedia page on ZWave! Interesting read, thanks guys! Great stuff to learn about!

firstof9 · August 24, 2021, 3:24pm

This is not a routing graph.

fresnoboy · August 24, 2021, 3:46pm

The network graph is more than useless. It has completely wrong information on it, and caused me to do a lot of damage to a perfectly well functioning zwave network when it said devices were routing through nodes they actually weren’t. Even the neighbor info is generally wrong.

I filed a bug on the zwavemqtt git repo suggesting the network graph be completely removed. It does much more harm than good right now.

mwav3 · August 24, 2021, 5:00pm

Yes, from the additional info and context from @HeyImAlex above I realize that now

mwav3 · August 24, 2021, 5:05pm

Do you have a link? I couldn’t find it on there. I’d be interested in how the developers respond.

fresnoboy · August 24, 2021, 5:20pm

Not sure why the developers don’t understand how bad it is.

HeyImAlex · August 24, 2021, 7:36pm

If it was be a simplified representation then that would be fine. But it’s completely wrong and misleading. And that’s a problem.

Not really, but you can make something more useful. Getting the precise routing requires a sniffer like @aruffell linked to. But even that just gives you a momentary view of the network state and is more useful for deep protocol debugging than for general usage. As far as I see it, it doesn’t make any sense to try representing a mesh network as a fixed hierarchical graph, because that’s not what it is. Mesh networks are mutable by definition and should be represented differently.

Do we really care about what route exactly a message took ? Not really, but we care about things like node congestion, weak paths or isolated nodes. These problems can be measured and represented as statistics over time. Basically collect network stats and aggregated routes over a period of say 24 hours. The representation should then display node connectivity as a type of traffic network. Heavily used routes could be represented as thick arrows, rarely used fallback routes as thin ones. Average signal strength between nodes should be shown. That way you can instantly see where the ‘data highways’ are, which nodes get congested because they end up being the sole router for half of your house. You can also see nodes that are struggling to relay their message due to a weak mesh. A top down hierarchical single connecting graph is just wrong and making it pretty won’t help.

Now it’s true that a lot of that information cannot be gathered by zwavejs on the primary controller. I don’t know if there’s even a way to query the nodes for their respective LWR and NLWR. The controller might not even be aware of certain node routes due to exploratory frames. A zwave sniffer would be needed for that. But what zwavejs could do is make use of signal RSSI. As far as I know, this info is available over the zwave API. The specs are also more or less explicit about how they establish routes based on signal dBm. While there probably are some proprietary details left out, it could be used to display more useful information. Like identifying nodes with weak connections or giving some best effort guesses for the most likely routes taken. Anything would be better than the current graph.

HeyImAlex · August 24, 2021, 7:42pm

On another note, digging around the zwavejs source I found out why network healing doesn’t always work as expected and sometimes leaves the network in a worse state than before. zwavejs does the healing by using the same flawed neighbor metric than the graph. It establishes a hieearchical connectivity graph based on neighbors (which is almost guaranteed to be wrong) and then starts the healing from the bottom up. The problem with this is that since the graph is wrong, the healing order will not be optimal (read: it will be random) and some nodes will be healed in the wrong order, missing potential repeater nodes in the process.

I mean it’s better than nothing for sure. But a better connectivity metric using signal levels should definitely be high up on their backlog priority.

mwav3 · August 24, 2021, 7:59pm

This is more concerning then just a bad graph, and I’ve noticed a heal usually “temporarily” makes things worse for me. I chalked it up to my mix of older and plus devices, but things always stabilize within a few days so I assume the network figures itself out. Healing individual troublesome nodes and ones around them that have log issues showing up has worked out much better for me then healing the whole network, and I haven’t done a network wide heal in awhile.

Have you commented on their github at all? Maybe even on the discussion linked by @fresnoboy ? The developer didn’t seem to entertain the idea of removing or changing the graph at all. Maybe hearing from more users could help them understand the concerns around the graph/heal? I’d be happy to add comments but I feel you have a much more solid grasp of the protocol and issues then I do to get the ball rolling again.

HeyImAlex · August 24, 2021, 8:23pm

No and honestly, I don’t plan to. The problem with issues like this is that they will always end in a battle of egos where technical argumentation is mostly ignored. I mean, his response ‘I see no reason why we should remove it, not all feedbacks about it are so bad’ pretty much shows this is a dead end. I’ve had my share trying to argue with developers over illogical issues like that (cough multifloor support in Valetudo cough) and I don’t want to get involved in that again. I’ve had a few PR’s that fixed bugs like that rejected before purely because reasons (read - I hurt someones feelings). These days I just pull the source, modify what I think is broken for me and that’s it.

HeyImAlex · August 24, 2021, 9:20pm

Everything in my home runs on MQTT, it’s the common backend for all of my smart devices. All except zwave, until I recently switched to zwavejs2mqtt. Reliability of well configured MQTT is extremely good. Performance wise, websockets are probably faster. But when it comes down to it, the biggest bottleneck in both cases is HA. zwavejs is super fast, Mosquitto is blazingly fast and HA is, well, not that fast But I never really noticed any lag, as long as you don’t overload the broker. HA using discoverable MQTT topics tends to do that from time to time, because of the shear amount of topics it subscribes to (and entities it creates). I use manually defined MQTT entities, so I don’t have this issue. And of course it depends on your CPU too.

The reason I do it this way rather than using the official zwave integration is mostly around my own philosophy. I don’t really agree with HA’s philosophy of being an all in one blackbox abstracting heterogeneous smart home devices. Because you end up becoming dependent on the whims of some HA integration developer, breaking changes and whatnot. I prefer having a completely open MQTT based backend and only use HA as a web frontend basically. I even started transitioning my automations out of HA. I like independent and easily replaceable subsystems. If needed, I could even replace HA with something else rather painlessly. Not that I want to (yet), HA is great But I like having options.

But yeah, for most users the websockets implementation is probably better.

fresnoboy · August 24, 2021, 9:54pm

I use the websockets implementation and run zwavejsmqtt (and deconz for that matter) on a pi and not on my HA machine. This has a lot of advantages:

I can run HA in a guest VM and be able to migrate it to other hosts in my cluster and not take any downtime, because there is no USB passthrough etc… to deal with.
I can put the Pi in the center of the home or wherever the best location is for those devices, and I can run multiples of them if I need to, and not have to worry about where HA runs.
I can change the versions of the zwavejsmqtt and deconz independently from the native integrations on HA. HA’s testing is not always robust (which is why we have a cascade of point releases after each month’s new version is released).

Now, on the Zwave point, I came from Homeseer which had pretty decent zwave support, but my lot and home are fairly big and I had issues with some of the devices from a reliability POV. I made the move to zigbee for everything except locks and some dome valves that control water and gas supplies. Zigbee has been phenomenally reliable compared with zwave, but the smaller network of mostly new zwave devices was reliable.

Reliable that was until I after I imported things in HA well, and then started to deal with trying to fix things the network graph was showing me. Only after I realized the data is totally useless did I stop and managed to get things stabilized.

I have to say, I can’t recommend HA to anyone who has a major zwave investment - the support is really pretty poor, and the network graph is just totally wrong. It’s worse than not having anything at all. But like you, I can’t understand why a developer would answer like that and not address how broken it is.

I wonder if Paulus or any of the principals understand how bad it is.

HeyImAlex · August 24, 2021, 10:06pm

Your three points are very valid, there can indeed be good reasons to keep zwavejs on an entirely separate system. Note that this also works with MQTT (obviously). About your third point, you can also keep both zwavejs and HA separate on the same system (and update / maintain them independently of each other) if you install HA core (venv) or HA container (Docker).

I agree that zwave support is pretty poor on HA compared to some commercial systems. But commercial systems will always have an advantage in that respect. They have paid developers, can license commercial zwave stacks and have a financial interest in not adding misleading features that people could use to ruin their setup and then proceed to ask for refunds or litigate. HA is all volunteers (with all the good and bad things this entails) and zwave is far from being an easy protocol to write a driver for. AlCalzone did a tremendeous job with zwavejs, especially considering it’s open source and he does it in his free time. HA zwave support has also improved a lot over the years, especially since they migrated to zwavejs. There’s still some (very) rough corners, but I’m sure they will eventually be ironed out.

petro · August 24, 2021, 10:18pm

It’s a custom addon, not official. Zwave JS is the official addon and it does not have the graph. Keep in mind that community addons are not official addons.

Zwavejs2mqtt is managed by people outside HA.

fresnoboy · August 24, 2021, 11:18pm

Thanks for pointing this out. I had no idea. I thought the Zwave part is the same and just that it runs separately was the addon part.

This explains a lot actually. Not sure how to fix it though.

And the heal nodes function, is that in the core ZwaveJS part?

petro · August 24, 2021, 11:28pm

It’s all ZwaveJS stuff. The only thing that home assistant manages is the integration & it’s ui and the official addon. The official addon is a small python wrapper for the zwave js server.

mwav3 · August 25, 2021, 12:14am

This definitely makes a lot of sense. I use zigbee2mqtt which doesn’t have websockets (yet at least) and because it’s mqtt I have node red flows that interact with it that are completely outside home assistant. That wouldn’t be possible with a home assistant websocket based Integration. Mqtt definitely leaves more flexibility to work with other things, even if websockets might be quicker for Home Assistant.

I only have 22 zwave devices so switching them down the road to mqtt wouldn’t be a big deal if it became a concern in the future, but if you have many devices, it would be more of a pain to reconfigure everything.

Getting back to the heal and network graph issue, I understand the frustration with trying to go against developers egos and try and get things working. I think this is a big enough issue though where I’ll try and do a bit more research and file an issue on the zwavejs github, ideally with some suggestion to fix it. A network heal shouldn’t make things worse, and clearly you identified an issue in the code where there is a problem. I experienced bad heals first hand and it sounds like @aruffell experienced it too which started this whole post.

I will say though despite this heal/graph issue, OpenZwave was terrible, and when the improved zwavejs2mqtt came out was what pushed me to switch everything to Home Assistant finally. I’m still very happy with it. They are a actively developing it and I see it improve with each release. It also remains way better then a cloud based Smartthings hub, although I’m sure better commercial zwave options exist. Hopefully they can get this sorted out.

HeyImAlex · August 25, 2021, 1:25am

I agree that this is a relatively serious problem, but I’m afraid there’s no easy fix for it. At least not until zwavejs implements a different internal structure representing the mesh topology in a stochastic manner based on signal strength. Instead of using fixed neighbors the mesh is represented by probabilities that a node may route threw another. I remember reading somewhere that something like that was on their todo list, but it doesn’t seem very high priority at this time.

Actually network healing is a pretty complex and non-trivial task. Earlier this year I designed and implemented my own diy long range mesh network protocol from scratch, operating on the 433MHz band, just for kicks. It was a very interesting learning experience and really showed me the subtleties and dark corners of mesh networking (like messages getting stuck in an infinite repeater loop for days before I noticed). Healing is non trivial because the healing of one node may propagate its effects over the entire network and affect the healing of other nodes - which again may affect the healing of the originating node. Building an optimal mesh may require a multi-pass healing approach where nodes are healed multiple times during the process. Thankfully most mesh networks are self healing (to a degree), so they tend to iterate towards a more stable solution with time by themselves, although it may not be the best possible solution (which may be prohibitively expensive to find on large networks).

So even if the zwavejs healing is not optimal, the damage done to the mesh is probably only temporary and will fix itself after awhile.

Oh certainly, zwavejs is a huge improvement over OZW. And ignoring the graph, the management UI that comes with zwavejs2mqtt is really good too.