70 Zwave device network stuck healing for over 24hrs

mwav3 · August 24, 2021, 7:59pm

This is more concerning then just a bad graph, and I’ve noticed a heal usually “temporarily” makes things worse for me. I chalked it up to my mix of older and plus devices, but things always stabilize within a few days so I assume the network figures itself out. Healing individual troublesome nodes and ones around them that have log issues showing up has worked out much better for me then healing the whole network, and I haven’t done a network wide heal in awhile.

Have you commented on their github at all? Maybe even on the discussion linked by @fresnoboy ? The developer didn’t seem to entertain the idea of removing or changing the graph at all. Maybe hearing from more users could help them understand the concerns around the graph/heal? I’d be happy to add comments but I feel you have a much more solid grasp of the protocol and issues then I do to get the ball rolling again.

HeyImAlex · August 24, 2021, 8:23pm

No and honestly, I don’t plan to. The problem with issues like this is that they will always end in a battle of egos where technical argumentation is mostly ignored. I mean, his response ‘I see no reason why we should remove it, not all feedbacks about it are so bad’ pretty much shows this is a dead end. I’ve had my share trying to argue with developers over illogical issues like that (cough multifloor support in Valetudo cough) and I don’t want to get involved in that again. I’ve had a few PR’s that fixed bugs like that rejected before purely because reasons (read - I hurt someones feelings). These days I just pull the source, modify what I think is broken for me and that’s it.

HeyImAlex · August 24, 2021, 9:20pm

Everything in my home runs on MQTT, it’s the common backend for all of my smart devices. All except zwave, until I recently switched to zwavejs2mqtt. Reliability of well configured MQTT is extremely good. Performance wise, websockets are probably faster. But when it comes down to it, the biggest bottleneck in both cases is HA. zwavejs is super fast, Mosquitto is blazingly fast and HA is, well, not that fast But I never really noticed any lag, as long as you don’t overload the broker. HA using discoverable MQTT topics tends to do that from time to time, because of the shear amount of topics it subscribes to (and entities it creates). I use manually defined MQTT entities, so I don’t have this issue. And of course it depends on your CPU too.

The reason I do it this way rather than using the official zwave integration is mostly around my own philosophy. I don’t really agree with HA’s philosophy of being an all in one blackbox abstracting heterogeneous smart home devices. Because you end up becoming dependent on the whims of some HA integration developer, breaking changes and whatnot. I prefer having a completely open MQTT based backend and only use HA as a web frontend basically. I even started transitioning my automations out of HA. I like independent and easily replaceable subsystems. If needed, I could even replace HA with something else rather painlessly. Not that I want to (yet), HA is great But I like having options.

But yeah, for most users the websockets implementation is probably better.

fresnoboy · August 24, 2021, 9:54pm

I use the websockets implementation and run zwavejsmqtt (and deconz for that matter) on a pi and not on my HA machine. This has a lot of advantages:

I can run HA in a guest VM and be able to migrate it to other hosts in my cluster and not take any downtime, because there is no USB passthrough etc… to deal with.
I can put the Pi in the center of the home or wherever the best location is for those devices, and I can run multiples of them if I need to, and not have to worry about where HA runs.
I can change the versions of the zwavejsmqtt and deconz independently from the native integrations on HA. HA’s testing is not always robust (which is why we have a cascade of point releases after each month’s new version is released).

Now, on the Zwave point, I came from Homeseer which had pretty decent zwave support, but my lot and home are fairly big and I had issues with some of the devices from a reliability POV. I made the move to zigbee for everything except locks and some dome valves that control water and gas supplies. Zigbee has been phenomenally reliable compared with zwave, but the smaller network of mostly new zwave devices was reliable.

Reliable that was until I after I imported things in HA well, and then started to deal with trying to fix things the network graph was showing me. Only after I realized the data is totally useless did I stop and managed to get things stabilized.

I have to say, I can’t recommend HA to anyone who has a major zwave investment - the support is really pretty poor, and the network graph is just totally wrong. It’s worse than not having anything at all. But like you, I can’t understand why a developer would answer like that and not address how broken it is.

I wonder if Paulus or any of the principals understand how bad it is.

HeyImAlex · August 24, 2021, 10:06pm

Your three points are very valid, there can indeed be good reasons to keep zwavejs on an entirely separate system. Note that this also works with MQTT (obviously). About your third point, you can also keep both zwavejs and HA separate on the same system (and update / maintain them independently of each other) if you install HA core (venv) or HA container (Docker).

I agree that zwave support is pretty poor on HA compared to some commercial systems. But commercial systems will always have an advantage in that respect. They have paid developers, can license commercial zwave stacks and have a financial interest in not adding misleading features that people could use to ruin their setup and then proceed to ask for refunds or litigate. HA is all volunteers (with all the good and bad things this entails) and zwave is far from being an easy protocol to write a driver for. AlCalzone did a tremendeous job with zwavejs, especially considering it’s open source and he does it in his free time. HA zwave support has also improved a lot over the years, especially since they migrated to zwavejs. There’s still some (very) rough corners, but I’m sure they will eventually be ironed out.

petro · August 24, 2021, 10:18pm

It’s a custom addon, not official. Zwave JS is the official addon and it does not have the graph. Keep in mind that community addons are not official addons.

Zwavejs2mqtt is managed by people outside HA.

fresnoboy · August 24, 2021, 11:18pm

Thanks for pointing this out. I had no idea. I thought the Zwave part is the same and just that it runs separately was the addon part.

This explains a lot actually. Not sure how to fix it though.

And the heal nodes function, is that in the core ZwaveJS part?

petro · August 24, 2021, 11:28pm

It’s all ZwaveJS stuff. The only thing that home assistant manages is the integration & it’s ui and the official addon. The official addon is a small python wrapper for the zwave js server.

mwav3 · August 25, 2021, 12:14am

This definitely makes a lot of sense. I use zigbee2mqtt which doesn’t have websockets (yet at least) and because it’s mqtt I have node red flows that interact with it that are completely outside home assistant. That wouldn’t be possible with a home assistant websocket based Integration. Mqtt definitely leaves more flexibility to work with other things, even if websockets might be quicker for Home Assistant.

I only have 22 zwave devices so switching them down the road to mqtt wouldn’t be a big deal if it became a concern in the future, but if you have many devices, it would be more of a pain to reconfigure everything.

Getting back to the heal and network graph issue, I understand the frustration with trying to go against developers egos and try and get things working. I think this is a big enough issue though where I’ll try and do a bit more research and file an issue on the zwavejs github, ideally with some suggestion to fix it. A network heal shouldn’t make things worse, and clearly you identified an issue in the code where there is a problem. I experienced bad heals first hand and it sounds like @aruffell experienced it too which started this whole post.

I will say though despite this heal/graph issue, OpenZwave was terrible, and when the improved zwavejs2mqtt came out was what pushed me to switch everything to Home Assistant finally. I’m still very happy with it. They are a actively developing it and I see it improve with each release. It also remains way better then a cloud based Smartthings hub, although I’m sure better commercial zwave options exist. Hopefully they can get this sorted out.

HeyImAlex · August 25, 2021, 1:25am

I agree that this is a relatively serious problem, but I’m afraid there’s no easy fix for it. At least not until zwavejs implements a different internal structure representing the mesh topology in a stochastic manner based on signal strength. Instead of using fixed neighbors the mesh is represented by probabilities that a node may route threw another. I remember reading somewhere that something like that was on their todo list, but it doesn’t seem very high priority at this time.

Actually network healing is a pretty complex and non-trivial task. Earlier this year I designed and implemented my own diy long range mesh network protocol from scratch, operating on the 433MHz band, just for kicks. It was a very interesting learning experience and really showed me the subtleties and dark corners of mesh networking (like messages getting stuck in an infinite repeater loop for days before I noticed). Healing is non trivial because the healing of one node may propagate its effects over the entire network and affect the healing of other nodes - which again may affect the healing of the originating node. Building an optimal mesh may require a multi-pass healing approach where nodes are healed multiple times during the process. Thankfully most mesh networks are self healing (to a degree), so they tend to iterate towards a more stable solution with time by themselves, although it may not be the best possible solution (which may be prohibitively expensive to find on large networks).

So even if the zwavejs healing is not optimal, the damage done to the mesh is probably only temporary and will fix itself after awhile.

Oh certainly, zwavejs is a huge improvement over OZW. And ignoring the graph, the management UI that comes with zwavejs2mqtt is really good too.