Zwave JS frequently dropping connections to endpoints

CaptBrat · August 22, 2022, 5:07pm

Hello all - I’m a fairly new Home Assistant User, but have been tinkering with home automation for a number of years via Smartthings. I moved to Home Assistant earlier this year and I’m experiencing some frustrations with my Zwave configuration.

Issue: I frequently find that some of my Zwave devices will show as unavailable after a non-specific period of time. Unfortunately, I have not found any commonality in what causes the devices to drop offline. Devices will show as a dead node (I usually discover this when they don’t interact with an automation routine), however, I can usually navigate to the device and press Ping and the device will come back online. On occasion, the recently pinged device will then show as active, but as soon as I try to interact with it, I’ll get a message saying the device is not available. Most of the devices are Zooz switches and Kwikset door locks.

What I’ve tried: I’ve repositioned the location of my HA server to try to rule out signal strength with some positive impact. I have extension cables on the USB Zwave controller. I have completely rebuilt my Zwave network (removing all devices and adding them back).

I’ve noticed that if devices start to become unresponsive, I can usually restart the Zwave JS service and they’ll often respond again. If that doesn’t work, I’ll sometimes get them to come back online after rebooting the HA entirely. If that doesn’t work, I can usually get them to come back online by physically interacting with the device (turning it off/on at the paddle switch or locking/unlocking the door, etc…).

What I’m Using: ZWave JS service. HA is running on a Raspberry Pi 4 (Home Assistant 2022.8.3; Supervisor 2022.08.3; Operating System 8.4; Frontend 20220802.0 - latest). USB Zwave stick is a Zooz Z-Wave Plus S2 (ZST10 700) that has been updated to the latest firmware.

What Next: I’m certain there’s additional troubleshooting/logs that would help paint a clearer picture, but I’m not entirely sure what would be the best, so I suspect that’s the first piece that I’ll need to follow up with - but I’m hoping that someone can opine as to what exactly would be helpful to further diagnose the issues.

joekaiser · August 22, 2022, 5:20pm

When did these issues start? I noticed my zwave devices have started to become less reliable since the 2022.8.5 (or maybe .6) update

CaptBrat · August 22, 2022, 6:11pm

I’ve experienced this since I first started with HA (I’d say since about 2022.5 for me). I thought I had nipped it in the bud with the overhaul of the entire Zwave network, but it has since come back around. The network will remain strong for a number of days before devices will start falling off. Healing the network doesn’t seem to do a ton (albeit, it may indeed be helping, but just not terribly quickly… I’ve been known to be a bit impatient).

FWIW, I’m currently on 2022.8.3 - I haven’t installed the last couple of intra-month releases as there hasn’t been much in the release notes that made me thing it was necessary to make the install.

AllHailJ · August 22, 2022, 6:17pm

I have a variety of z-wave devices and am running 2022.8.6 and I do not see this problem. Only nodes that don’t respond immediately are battery powered nodes. I reboot my system automagically from an automation. I use garbage collection (hacs) to set the days. I reboot once every 5 days.

Do you enroll all nodes securely? I read a post and can’t find it now, where there were problems with the 700 series controllers in secure mode.

- id: 'Reboot HA Core'
  alias: Reboot HA Core
  description: Reboot home assistant at time set by input datetime
  trigger:
  - platform: time
    at: input_datetime.reboot_ha_core
  condition:
  - condition: state
    entity_id: sensor.boot_hacore
    state: '0'
  action:
  - service: homeassistant.restart
  mode: single

CaptBrat · August 22, 2022, 6:33pm

Interesting thought. Assuming I’m reviewing the correct info, it looks like my devices are all over the map. I have some devices that are not enrolled with any level of security – others that are “Highest Security: S0 Legacy”, “Highest Security: S2 Unauthenticated”, up to “Highest Security: S2 Authenticated”. Is there a way for me to force them to a higher level of security? What they have currently is whatever they set when I enrolled them.

RE reboot cadence: I have some automations that run over an extended period of time. For example, the automation that controls my exterior lighting begins about 40 minutes before sundown and wraps up about 8:30 AM each day. I manually rebooted my system last night (when one of my nodes went offline and I was unable to get it to respond with some of the other methods) and found this morning that my exterior lights were off because my automation didn’t resume (which I believe is expected behavior with mid-flight automations after a reboot). Generally speaking, I’d execute the reboot overnight as it’s the time that it’s least likely to interrupt anything. Are you aware of any way to check the status of any automations and resume them? Or should I be breaking that automation into two parts to allow for me to ‘plan’ for the reboot automation?

AllHailJ · August 22, 2022, 6:59pm

If I remember correctly you need to set all nodes to the lowest security except for locks. Locks must be enrolled securely.

I try to make my automations reboot proof. Basically they will only fail if you reboot at the exact time the automation was supposed to run. My irrigation automation is the only automation that breaks the rule.

- id: 'Front entry light on at sunset and off at sunrise'
  alias: Front Entry Light off on
  description: 'Turn on the lights to keep the birds out of the entryway'
  trigger:
  - id: 'off'
    platform: sun
    event: sunrise
    offset: 00:15:00
  - id: 'on'
    platform: sun
    event: sunset
    offset: -00:15:00
  condition: []
  action:
  - choose:
    - conditions:
      - "{{ trigger.id == 'on' }}"
      sequence:
      - service: switch.turn_on
        target:
          entity_id: switch.outside_front_entry_light
    - conditions:
      - "{{ trigger.id == 'off' }}"
      sequence:
      - service: switch.turn_off
        target:
          entity_id: switch.outside_front_entry_light
    default: []    
  mode: single

CaptBrat · August 22, 2022, 9:56pm

I’m open to trying to change the security of the devices to see if that resolves the issue. I’ll have to look up how I do that (ideally without having to physically unpair and repair each device).

I came back from a meeting and found that the plug I have controlling the power to my 3D printer was offline. I was not able to grab a screenshot of the error message that I was getting when trying to interact with the plug, but here’s what the device history looks like from the interface - if there’s a better way to share more info, please let me know. Prior to this screenshot, the node was online and connected from the point that I turned it on this morning.

Interestingly, at the same time, the 3D Printer plug was reported as dead so did the light switch in the same room (although I don’t believe that proximity is responsible) - as before, pressing “Ping” on the light switch immediately caused the node to come back online and I was able to control the switch again.
Screenshot 2022-08-22 165548

joekaiser · August 23, 2022, 1:14am

You are describing exactly the behavior of my zwave network. It’s frustrating

CaptBrat · August 23, 2022, 11:30am

Are you using the same zwave controller?

joekaiser · August 23, 2022, 1:17pm

No, I have the Nortek gocontrol zwave/ZigBee stick

CaptBrat · August 23, 2022, 2:06pm

Ahh… is that a 700 series?

joekaiser · August 23, 2022, 2:45pm

No, it is a 500series. It is all I could get at the time

CaptBrat · August 23, 2022, 3:11pm

Hmm… I was hoping that the 500 series might be the path to fix/resolve the issue. Perhaps I’ll hold off on migrating over (and buying a new USB stick) and see if there are any other suggestions. I have seen some automation in the Blueprint Exchange that sends a ping to the device as soon as the node shows up as dead and wakes it back up. I’m going to test that out and see if it helps…

joekaiser · August 23, 2022, 3:18pm

To be fair, I have seen a few posts about my stick not being very good so it could be that 🤷

AllHailJ · August 23, 2022, 4:50pm

Is there any chance that you have a bad z-wave controller? I lost a 500 series USB stick about 4 years ago. It was erratic like you have explained.

The only way I know how to test this is to buy another and transfer over all the nodes. Have you set the log level to silly in the integration and looked for a problem there? It may give you some guidance.

Could the RPI have port problems?

I have had to unplug the dongle a couple of times to get z-wave back. It is like the computer and the dongle quit talking to each other. When I unplugged and plugged the dongle back in and rebooted everything worked.

I run a VM on Linux with supervisor so different from the RPI. Do you have another computer that you could run a HA virtual machine to test your dongle to see if it is something on the RPI?

I am throwing out ideas to see what sticks in the hopes that one might help.

CaptBrat · August 23, 2022, 5:00pm

All great thoughts, thanks.

Yes, absolutely possible that the controller is flaking out/going bad. I’ve started to look at other options, but this will be a bit further down my list of troubleshooting steps as I’d like to do a bit of research on what I’d buy to replace it.
The Pi could also have a bad port, although, I suspect that this is not the cause. I have switched the USB sticks (I have 1 for Zwave and 1 for Zigbee) around and they seem to continue to function fine - the Zigbee network has (by and large) been solid and I’ve tested it running in the same USB port.
I will try out turning the log level to Silly - Question Regarding This: Once I do that, I presume that I’d need to let the system run for a period of time to then be able to review the log, correct? Is the log automatically saved?
I unplugged the device when the HA server was powered off (and unplugged – apparently just shutting it down doesn’t necessarily cut all the power). It seemed to help, HOWEVER, the issues seem to return after a period of time…
I do have another machine, however, I’m not familiar with running virtual machines. I suspect this will be a step I’d try after testing a new ZWave stick.

AllHailJ · August 23, 2022, 5:54pm

Correct. No the log is not saved automatically and will clear if you leave the integration configuration.

I just see each communication between the nodes and the controller.

2022-08-23T17:48:42.481Z SERIAL « 0x01090004002403250300f3                                            (11 bytes)
2022-08-23T17:48:42.483Z CNTRLR   [Node 036] [~] [Binary Switch] currentValue: false => false       [Endpoint 0]
2022-08-23T17:48:42.484Z SERIAL » [ACK]                                                                   (0x06)
2022-08-23T17:48:42.485Z DRIVER « [Node 036] [REQ] [ApplicationCommand]
                                  └─[BinarySwitchCCReport]
                                      current value: false
2022-08-23T17:49:19.968Z SERIAL « 0x010b0004004d0531050301008e                                        (13 bytes)
2022-08-23T17:49:19.970Z CNTRLR   [Node 077] [Multilevel Sensor] Illuminance: metadata updated      [Endpoint 0]
2022-08-23T17:49:19.972Z CNTRLR   [Node 077] [~] [Multilevel Sensor] Illuminance: 0 => 0            [Endpoint 0]
2022-08-23T17:49:19.973Z SERIAL » [ACK]                                                                   (0x06)
2022-08-23T17:49:19.974Z DRIVER « [Node 077] [REQ] [ApplicationCommand]
                                  └─[MultilevelSensorCCReport]
                                      type:  Illuminance
                                      scale: Percentage value
                                      value: 0
2022-08-23T17:49:57.253Z SERIAL « 0x010b00040022053105030103e2                                        (13 bytes)
2022-08-23T17:49:57.255Z CNTRLR   [Node 034] [Multilevel Sensor] Illuminance: metadata updated      [Endpoint 0]
2022-08-23T17:49:57.256Z CNTRLR   [Node 034] [~] [Multilevel Sensor] Illuminance: 5 => 3            [Endpoint 0]
2022-08-23T17:49:57.258Z SERIAL » [ACK]                                                                   (0x06)
2022-08-23T17:49:57.259Z DRIVER « [Node 034] [REQ] [ApplicationCommand]
                                  └─[MultilevelSensorCCReport]
                                      type:  Illuminance
                                      scale: Percentage value
                                      value: 3
2022-08-23T17:50:19.975Z SERIAL « 0x010b0004004d0531050301008e                                        (13 bytes)
2022-08-23T17:50:19.977Z CNTRLR   [Node 077] [Multilevel Sensor] Illuminance: metadata updated      [Endpoint 0]
2022-08-23T17:50:19.979Z CNTRLR   [Node 077] [~] [Multilevel Sensor] Illuminance: 0 => 0            [Endpoint 0]
2022-08-23T17:50:19.980Z SERIAL » [ACK]                                                                   (0x06)
2022-08-23T17:50:19.982Z DRIVER « [Node 077] [REQ] [ApplicationCommand]
                                  └─[MultilevelSensorCCReport]
                                      type:  Illuminance
                                      scale: Percentage value
                                      value: 0

You can turn devices on and off and watch in the logs the communication.

NCO3 · August 26, 2022, 7:36pm

I have the impression as well that Zwave reliability changed after one of the recent updates.

CaptBrat · September 1, 2022, 7:30pm

I just got a new Zwave update recently. The documentation was minimal. I tried it, and it doesn’t seem to have changed much in either direction.

The problem is becoming more troublesome as I’m now getting comments from my wife. WAF is at risk

NCO3 · September 2, 2022, 8:49am

Good luck - that’s a tough one