Z-Wave status not updating for lights

petro · January 6, 2023, 3:55pm

I’m not sure, none of my devices are like that and none of my devices change their configuration like that. It might be worth looking in your device manual to see if there’s another parameter associated with that parameter.

PeteRage · January 7, 2023, 1:34pm

I’d start by setting those all to 5 minutes and see if you can get the network stable.

That websocket error means that the system is sending too much data for the UI to keep up. Most likely the HASS system cpu is maxing from too much data coming in.

Likely the switch state messages are getting dropped by the device since it’s too busy trying to send the power data.

petro · January 7, 2023, 2:21pm

It can be anything that uses websockets, not just the UI.

PeteRage · January 7, 2023, 2:45pm

Yes, good point, likely the WS connection to zwavejs. Those error messages should be including the end-point to aid it diagnosing.

rossco555 · January 10, 2023, 3:28pm

I kind of figured there was something along these lines - some issue on the z-wave side swamping the websocket. But I use node-red as well so guess that could also be the cause, not just the innocent ocassional victim. They don’t make it easy…

petro · January 10, 2023, 5:06pm

based on what I see in your Zwave logs, I would assume it’s not zwave.

jriker1 · January 11, 2023, 6:50pm

For me PeteRage’s suggestion worked. I updated the firmware on those devices and started responding appropriately again.

rossco555 · January 12, 2023, 11:32am

Ok, cheers for that. Will have to play around a bit.

petro · January 12, 2023, 12:34pm

IMO, I don’t think that’s your problem. I think you have a node red automation that’s firing too frequently.

rossco555 · January 12, 2023, 12:46pm

Yeah ok, I’m thinking. Is there a way to debug that that you can suggest?

petro · January 12, 2023, 12:47pm

I’m not sure, I don’t use Node Red. But your zwave network isn’t that chatty, so I don’t think it’s zwave.

rossco555 · January 12, 2023, 2:30pm

The more I look at it, it has to be a z-wave issue. I’ve disabled node-red (stopped the container completely) and it has little bearing. I have the mqtt disabled in zwavejsui and I’m thinking its related to the web socket, or the network itself. (But I’m convinced any ambiguous web socket errors I’ve seen in the past in HA are related to the z-wave issues, not causing them.)

My zigbee devices via zigbee2mqtt are near instantaneous. These use a USB stick sharing a connection to my Synology NAS via a 2-port USB hub. That’s been happy for many months before recently so I’m confident the issues don’t relate to that.

The z-wave network seems to have settled down a lot (lately, since my original post through no real action from me and before disabling node-red just now), and the latency is now only about 5-10 seconds between operating a physical switch and seeing some noise in the zwavejsui logs. But obviously that’s not really palatable still.

The only other thing I can note with my level of knowledge and investigatory skills is that most nodes seem to be passing through two particular nodes, for reasons I can’t even speculate… those nodes aren’t significantly geographically closer to the controller than the others, don’t have any special security inclusions, etc, etc so no idea why everyone else chooses to hop through them.

But, following that lead / possibility they are somehow culprits: almost humorously, observing logs while switching those devices and the one node closest to me that I’ve been doing all my previous observations against:

Log turned on
Hallway lights switched on - no log activity - turned off - no log activity
Master Bedroom lights switched on - no log activity - turned off - no log activity
Master Ensuite toilet lights turned on (one switch in a dual device) - logs activity within 5-10 seconds
Still nothing form Hallway or Master Bedroom…

I think I’ve updated the firmware for every device. Although its curious I have two identical devices running diffierent firmware versions both saying they’re up to date… so maybe I have another issue there.

petro · January 12, 2023, 2:38pm

You can’t rely on the graph in ZwaveJS UI. It’s simply a pretty picture, it does not represent the routing tables. There’s no way to find out the routing tables in ZwaveJS at this time. Just ignore that whole graph.

Do you have a strong mesh? The more routing devices you have, the better the network behaves. Also, a heal network will optimize your routes, but it takes a long time and it’ll wait for battery devices to wake up.

rossco555 · January 12, 2023, 3:01pm

How can I determine whether I have a strong mesh? I have 24 devices, mostly in-roof lighting controllers. My home is single storey and not large so I would think its as good as it gets in practice.

I did initiate a heal after my previous post and can see that its still going. I’ve got a window sensor in a kids room I can’t wake for fear of waking something else, but will let it do its thing and open that when I can. Its coming back to me that I may have done a heal a while back and that possible coincides with the issue… to note also, I have 3 or 4 wired devices currently disconnected entirely, such as node 8 here. Is there a chance that could undermine the whole heal process and may have been what caused my woes?

Thanks for all your help petro.

petro · January 12, 2023, 3:02pm

Yes, all devices should be available during a heal

PeteRage · January 12, 2023, 6:13pm

Take a look at the resource utilization both on the NAS and the docker containers.

Since it’s taking 5-10 seconds for message to appear in the zwave log, we need to find what is slowing it down. So we need to go step by step. I do think it is valuable to do heals one at a time on the mains powered zwave devices. Once that is done, do a shutdown of the NAS, disconnect power, disconnect the USB hub, pull the zwave stick and start the NAS back up. Sometimes the USB drivers, Linux, docker stuff just gets fouled up. While you do this, also make sure the stick is on an extension cable. I found 3’ to be the right length and 5’ to be too long. Get it located away from the NAS. Then look at these items

CPU / Memory Utilization of the NAS, Container, Disk - steady state you are aiming for <10%.
Try setting the switches from HA, how long does it take for the physical switch to turn on. This should be very quick. Is it the same for all devices?
You may have a bad mains device that is introducing latency and that should be shown by the prior test. Power cycling all of them could be done proactively, or toss the whole house breaker, or work through them 1 by 1.

PeteRage · January 12, 2023, 6:18pm

The battery devices don’t need a heal yet, to abort the current heal, just restart zwavejs. Then go through and heal the mains devices 1 by 1, since they are online this goes fast - like 2 seconds per device.

rossco555 · January 13, 2023, 3:02pm

Thanks guys. I had done the full reboot and then the heal last night but to ensure I’m following PeteRage’s order I’ve now just:

Healed mains devices (~20 off) one by one
Restarted the NAS
Run around and operated each lighting device via the mechanical switch and HA app

USB dongles for Z-Wave (Zooz), and a ZigBee stick (Conbee II) share a 2-port hub at the end of a 3m extension run up into the roof space well away from the NAS and on a level playing field with most of the lighting devices installed up there.

NAS (DS920+) resource utilisation:

CPU 5%
Memory 18% (20GB with unofficial upgrade and can’t go beyond)

The results:

So is this pointing to certain devices being really ordinary, and overall Z-Wave latency aint great. I have noticed the Qubino dimmers in particular are temperamental in their feedback to HA, but I just don’t recall this sort of general latency on the feedback side.

I thought HA was “optimistic” in terms of showing the expected state while awaiting feedback, but maybe that’s not possible with dimming? Generally speaking, turning something off in HA was pretty much instantaneous both in action at the light, and in HA’s reflection of that… but HA’s reflection of the on state was generally laggy, and broadly variable across the three device types I have, and even within the two circuits of one dual switch device as noted!

The above possibly points at some poor z-wave devices, but one thing I can’t ignore is the websocket issue when HA restarts (after the NAS reboot I performed for example) which I feel is something only about as old as my issue is and which I did flag earlier in this post. I get:

Logger: homeassistant.components.websocket_api.http.connection
Source: components/websocket_api/http.py:157
Integration: Home Assistant WebSocket API (documentation, issues)
First occurred: 9:51:50 PM (1 occurrences)
Last logged: 9:51:50 PM

[140257480236272] Client unable to keep up with pending messages. Stayed over 512 for 5 seconds. The system's load is too high or an integration is misbehaving

The Node-Red container is stopped so its zwavejsui or something else for mine. And I can guess that if its zwavejsui, it is feasible that that may cause asynchronous latency…?

PeteRage · January 13, 2023, 10:00pm

That rules out a bunch of stuff. I have a 920+ also.

It good to know the system can work fast when sending commands to devices.

I believe Lovelace keeps the set state for a second then reverts unless it gets an update.

I agree we should chase that websocket thing as that is not normal. Try turning debug logging on, in configuration.yaml

logger:
  default: debug

Then restart. Let’s see if we get any additional log entries regarding that message that help figure out what it is.

I do not like that those three devices never report or it takes a minute. Do the zwavejs stats look ok?

Are you running any custom integrations? Any changes to the zigbee network recently?

PeteRage · January 13, 2023, 10:11pm

I’m looking at that websocket code:

    @callback
    def _check_write_peak(self, _utc_time: dt.datetime) -> None:
        """Check that we are no longer above the write peak."""
        self._peak_checker_unsub = None

        if self._to_write.qsize() < PENDING_MSG_PEAK:
            return

        self._logger.error(
            "Client unable to keep up with pending messages. Stayed over %s for %s seconds. "
            "The system's load is too high or an integration is misbehaving",
            PENDING_MSG_PEAK,
            PENDING_MSG_PEAK_TIME,
        )
        self._cancel()

It’s then going to call cancel() which is going to kill the connection and also lose any messages that have not been processed.

Let’s try this for logging first, as it will focus just on this piece of code. It does have debug output for messages sent and received.

logger:
  default: info
  logs:
    homeassistant.components.websocket_api.http.connection: debug