70 Zwave device network stuck healing for over 24hrs

My understanding is zwavejs came first, and zwavejs2mqtt runs on top of zwavejs to integrate it to mqtt and add the control panel, which makes it usable well beyond home assistant. Think of zwavejs like dos and zwavejs2mqtt like windows, one is built on top of the other. Since zwavejs is the “base”, it probably was easier for Home Assistant to integrate it as the “official” zwave Integration, vs zwavejs2mqtt is a community addon.

To make it more confusing, zwavejs2mqtt can be run in websocket mode with the mqtt turned off (how I run it) but links to Home Assistant using the zwavejs integration. I run homeassistant container so I had to use zwavejs2mqtt because I don’t have the addons, so I had to do everything the hard way.

The advanced zwave configuration documentation does a decent job explaining the setup.

1 Like

Just as a warning, the network map is entirely useless. While it looks nice, the information it shows is simply the node neighbor graph. That is completely unrelated to the actual routing and mesh topography used internally by zwave. The real route your data takes over the mesh can often be totally different from what you see on the graph. It can change on its own depending on signal noise levels, it can route over nodes that are not listed as neighbors in the graph and it can even be asymmetrical (send and receive can go over different routes).

That said, switching to zwavejs is totally worth it. I did it myself recently and now I wonder why I stuck with OpenZWave for so long. Oh and you should definitely use the UI that comes with zwavejs2mqtt.

The integration I am using is the default ZWave JS integration so I need to read up on migrating to ZWave 2 MQTT without a major re-do of the setup as I’ve already lost a few days on that due to issues.

While I haven’t used it in a long time, I do have this tool:

Maybe it is time to test it out again to see if it can help resolve my issues. Right now everything is working fine but I expect havoc to happen again :slight_smile:

ZWaveJS2MQTT. It’s all a bit confusing, I agree. The history is basically that the guy who made ZWave2MQTT, which was and still is unrelated to Home Assistant, previously used OpenZWave as a backend driver. Then OpenZWave went MIA and he switched to zwavejs. Thus ZWave2MQTT became ZWaveJS2MQTT.

This was before HA decided to switch to zwavejs as well. Which was done in a rush, so some suboptimal decisions were taken. There’s all kind of different addons, etc, and a HA native UI is being developed instead of reusing the one already available in ZWaveJS2MQTT. But in the end, they all use the same zwavejs backend driver.

In fact, you can entirely remove all zwave related integrations from HA and only use ZWaveJS2MQTT on its own, which will then communicate over HA with MQTT. That’s what I ended up doing, as I’m trying to reduce dependencies on integrations in HA to an absolute minimum. So there’re lots of different options.

Oh yeah, that will definitely show you the real routes and much more. Needs a bit of knowledge about the internal operations of the zwave protocol to use effectively though.

1 Like

The network map shows the “preferred route”, which might not be the actual route a message takes in a mesh network. So, I wouldn’t say “entirely useless”, but take its information with a grain of salt. The information in the graph could be useful, especially the number of hops, and weaknesses in the network. Another thing I’ve found useful are the logs. There’s two sets of them in zwavejs2mqtt, general and zwave, and they can tell you about problems with the devices.

Showing all the potential network paths would create a messy unusable graph. Here’s a discussion by the developers of zwavejs2mqtt discussing how to represent the network and what they decided. They wanted to avoid a graph, that looks like a “ball of wool”.

I’m sure that zwave toolbox device will tell you way more advanced diagnostic info then zwavejs2mqtt can though.

I tried running it this way and the websockets seemed quicker and more reliable then mqtt. You have to rely on an mqtt broker this way vs the websockets and are adding an additional “middle man” for communication. I’m sure this makes sense depending on your use case. Are you using mqtt for things outside home assistant? If so it definitely would make sense to do it this way. Otherwise I’m not sure what the concerns are around the dependcy on the zwave Integration.

1 Like

But does it ? I just dug into the sources for zwavejs2mqtt and zwavejs to find out.

In order to build the graph, zwavejs2mqtt only uses the node neighbor lists (see ZwaveGraph.vue). Hops and links between nodes are determined by looking at the first neighbor if multiple are present. Which is almost certainly 100% wrong, unless you have a node with only a single neighbor (and even then, it can be wrong, see below). When I do a neighbor update on my zwavejs, the returned lists are sequential node IDs. There is no information about LWR/NLWR (what you would call preferred route), nor any information about signal levels.

To create the neighbor lists, zwavejs calls the ZW_GetRoutingInfo method on the zwave API (see GetRoutingInfoResponse). The zwave API reference does not state any guarantee that the order of returned neighbors are in any way related to the APR/LWR/NLWR. It only states that (section 4.4.15):

ZW_GetRoutingInfo is a function that can be used to read out neighbor information from the protocol.This information can be used to ensure that all nodes have a sufficient number of neighbors and to ensure that the network is in fact one network.

In fact, the standard clearly defines when neighbor lists are used for routing, and that’s almost never (section 3.10.2):

The routing attempts done by a static controller to reach the destination node are as follows:

* If APR, LWR and NLWR all are non-existing and TRANSMIT_OPTION_ACK set. Try direct when neighbors with retries.
* If APR exist and TRANSMIT_OPTION_ACK set. Try the APR. If APR fails then try LWR if it exist and if it also fails then remove the LWR and try direct if neighbor.
* If APR do not exist, LWR exist and TRANSMIT_OPTION_ACK set. Try the LWR. In case the LWR fails, ‘exile’ it to become NLWR and try old NLWR if it exist. if the NLWR also fails, remove it and try direct if neighbor.
* If APR do not exist, LWR do not exist, NLWR exist and TRANSMIT_OPTION_ACK set. Try the NLWR. In case the NLWR fails remove itand try direct if neighbor.

So basically neighbors are only used as a ultimate fallback if everything else fails. And even then it will route through the neighbor with the strongest signal. And that is not available / used in zwavejs and the network graph either.

So in conclusion, what the zwavejs network graph displays has nothing to do with the routing. It’s completely random and should not be relied upon.

Yeah that discussion made me smile. How to display random data in a pretty way :grin:

About the MQTT, there are pros and cons. I’ll reply to this later, I’m on the go right now.

1 Like

Yeah I agree, there’s definitely way more to it then just pulling the first neighbor and making a graph out of it. Zwave routing is a complex topic and it looks like the developers chose to just oversimplify it to make a “pretty” graph.

I guess the question is, can you even make a reliable routing graph with the information that would be available to a zwave management program like zwavejs2mqt? The key to determine routing seems to be in the “routing table” stored on the controller, based on my understanding and this great article - Understanding Z-Wave Networks, Nodes & Devices – Vesternet It seems some information on the zwave protocol is proprietary and seeing the routing tables stored on the controller relies on using the silicon labs own PC controller software. I’ve used that program to update firmware and would say it’s far from “user friendly”.

Thanks for the info and definitely interested to hear your thoughts on MQTT vs Websockets.

1 Like

This thread is turning into a Wikipedia page on ZWave! Interesting read, thanks guys! Great stuff to learn about!

This is not a routing graph.

The network graph is more than useless. It has completely wrong information on it, and caused me to do a lot of damage to a perfectly well functioning zwave network when it said devices were routing through nodes they actually weren’t. Even the neighbor info is generally wrong.

I filed a bug on the zwavemqtt git repo suggesting the network graph be completely removed. It does much more harm than good right now.

Yes, from the additional info and context from @HeyImAlex above I realize that now

Do you have a link? I couldn’t find it on there. I’d be interested in how the developers respond.

Not sure why the developers don’t understand how bad it is.

2 Likes

If it was be a simplified representation then that would be fine. But it’s completely wrong and misleading. And that’s a problem.

Not really, but you can make something more useful. Getting the precise routing requires a sniffer like @aruffell linked to. But even that just gives you a momentary view of the network state and is more useful for deep protocol debugging than for general usage. As far as I see it, it doesn’t make any sense to try representing a mesh network as a fixed hierarchical graph, because that’s not what it is. Mesh networks are mutable by definition and should be represented differently.

Do we really care about what route exactly a message took ? Not really, but we care about things like node congestion, weak paths or isolated nodes. These problems can be measured and represented as statistics over time. Basically collect network stats and aggregated routes over a period of say 24 hours. The representation should then display node connectivity as a type of traffic network. Heavily used routes could be represented as thick arrows, rarely used fallback routes as thin ones. Average signal strength between nodes should be shown. That way you can instantly see where the ‘data highways’ are, which nodes get congested because they end up being the sole router for half of your house. You can also see nodes that are struggling to relay their message due to a weak mesh. A top down hierarchical single connecting graph is just wrong and making it pretty won’t help.

Now it’s true that a lot of that information cannot be gathered by zwavejs on the primary controller. I don’t know if there’s even a way to query the nodes for their respective LWR and NLWR. The controller might not even be aware of certain node routes due to exploratory frames. A zwave sniffer would be needed for that. But what zwavejs could do is make use of signal RSSI. As far as I know, this info is available over the zwave API. The specs are also more or less explicit about how they establish routes based on signal dBm. While there probably are some proprietary details left out, it could be used to display more useful information. Like identifying nodes with weak connections or giving some best effort guesses for the most likely routes taken. Anything would be better than the current graph.

3 Likes

On another note, digging around the zwavejs source I found out why network healing doesn’t always work as expected and sometimes leaves the network in a worse state than before. zwavejs does the healing by using the same flawed neighbor metric than the graph. It establishes a hieearchical connectivity graph based on neighbors (which is almost guaranteed to be wrong) and then starts the healing from the bottom up. The problem with this is that since the graph is wrong, the healing order will not be optimal (read: it will be random) and some nodes will be healed in the wrong order, missing potential repeater nodes in the process.

I mean it’s better than nothing for sure. But a better connectivity metric using signal levels should definitely be high up on their backlog priority.

2 Likes

This is more concerning then just a bad graph, and I’ve noticed a heal usually “temporarily” makes things worse for me. I chalked it up to my mix of older and plus devices, but things always stabilize within a few days so I assume the network figures itself out. Healing individual troublesome nodes and ones around them that have log issues showing up has worked out much better for me then healing the whole network, and I haven’t done a network wide heal in awhile.

Have you commented on their github at all? Maybe even on the discussion linked by @fresnoboy ? The developer didn’t seem to entertain the idea of removing or changing the graph at all. Maybe hearing from more users could help them understand the concerns around the graph/heal? I’d be happy to add comments but I feel you have a much more solid grasp of the protocol and issues then I do to get the ball rolling again.

No and honestly, I don’t plan to. The problem with issues like this is that they will always end in a battle of egos where technical argumentation is mostly ignored. I mean, his response ‘I see no reason why we should remove it, not all feedbacks about it are so bad’ pretty much shows this is a dead end. I’ve had my share trying to argue with developers over illogical issues like that (cough multifloor support in Valetudo cough) and I don’t want to get involved in that again. I’ve had a few PR’s that fixed bugs like that rejected before purely because reasons (read - I hurt someones feelings). These days I just pull the source, modify what I think is broken for me and that’s it.

2 Likes

Everything in my home runs on MQTT, it’s the common backend for all of my smart devices. All except zwave, until I recently switched to zwavejs2mqtt. Reliability of well configured MQTT is extremely good. Performance wise, websockets are probably faster. But when it comes down to it, the biggest bottleneck in both cases is HA. zwavejs is super fast, Mosquitto is blazingly fast and HA is, well, not that fast :wink: But I never really noticed any lag, as long as you don’t overload the broker. HA using discoverable MQTT topics tends to do that from time to time, because of the shear amount of topics it subscribes to (and entities it creates). I use manually defined MQTT entities, so I don’t have this issue. And of course it depends on your CPU too.

The reason I do it this way rather than using the official zwave integration is mostly around my own philosophy. I don’t really agree with HA’s philosophy of being an all in one blackbox abstracting heterogeneous smart home devices. Because you end up becoming dependent on the whims of some HA integration developer, breaking changes and whatnot. I prefer having a completely open MQTT based backend and only use HA as a web frontend basically. I even started transitioning my automations out of HA. I like independent and easily replaceable subsystems. If needed, I could even replace HA with something else rather painlessly. Not that I want to (yet), HA is great :slightly_smiling_face: But I like having options.

But yeah, for most users the websockets implementation is probably better.

I use the websockets implementation and run zwavejsmqtt (and deconz for that matter) on a pi and not on my HA machine. This has a lot of advantages:

  1. I can run HA in a guest VM and be able to migrate it to other hosts in my cluster and not take any downtime, because there is no USB passthrough etc… to deal with.

  2. I can put the Pi in the center of the home or wherever the best location is for those devices, and I can run multiples of them if I need to, and not have to worry about where HA runs.

  3. I can change the versions of the zwavejsmqtt and deconz independently from the native integrations on HA. HA’s testing is not always robust (which is why we have a cascade of point releases after each month’s new version is released).

Now, on the Zwave point, I came from Homeseer which had pretty decent zwave support, but my lot and home are fairly big and I had issues with some of the devices from a reliability POV. I made the move to zigbee for everything except locks and some dome valves that control water and gas supplies. Zigbee has been phenomenally reliable compared with zwave, but the smaller network of mostly new zwave devices was reliable.

Reliable that was until I after I imported things in HA well, and then started to deal with trying to fix things the network graph was showing me. Only after I realized the data is totally useless did I stop and managed to get things stabilized.

I have to say, I can’t recommend HA to anyone who has a major zwave investment - the support is really pretty poor, and the network graph is just totally wrong. It’s worse than not having anything at all. But like you, I can’t understand why a developer would answer like that and not address how broken it is.

I wonder if Paulus or any of the principals understand how bad it is.

2 Likes

Your three points are very valid, there can indeed be good reasons to keep zwavejs on an entirely separate system. Note that this also works with MQTT (obviously). About your third point, you can also keep both zwavejs and HA separate on the same system (and update / maintain them independently of each other) if you install HA core (venv) or HA container (Docker).

I agree that zwave support is pretty poor on HA compared to some commercial systems. But commercial systems will always have an advantage in that respect. They have paid developers, can license commercial zwave stacks and have a financial interest in not adding misleading features that people could use to ruin their setup and then proceed to ask for refunds or litigate. HA is all volunteers (with all the good and bad things this entails) and zwave is far from being an easy protocol to write a driver for. AlCalzone did a tremendeous job with zwavejs, especially considering it’s open source and he does it in his free time. HA zwave support has also improved a lot over the years, especially since they migrated to zwavejs. There’s still some (very) rough corners, but I’m sure they will eventually be ironed out.