Z-Wave dead, then worked, then dead again

I think my Z-Stick is dying, but I’m hoping it’s something related to a configuration issue. I have two HA instances that I update at the same time - so they’re both kept at the same version and as close, in configuration, as I can keep them to each other. (One’s in our barn, one in the house - both on the same LAN, but out of wifi range from each other.)

One HA instance, the one in the barn, is fine. The other one, in the house, had problems last week - in short, it was dead. I saw a pending update for HA core and HAOS, so I updated them, restarted, and it was working. Then, tonight, I saw the Z-wave wasn’t working in the house. (It worked this morning, though.) There were, again, HAOS and HA Core waiting for updates, so I updated them and rebooted. Still no luck.

HA Version info:
Screenshot 2024-06-24 at 2.47.15 AM
Z-Wave JS info:
zwave-js-ui: 9.14.1
zwave-js: 12.11.1
home id: 4229488702
home hex: 0xfc18e03e

Here’s my device list for ZWJSUI. This includes the Z-Stick info, which is from Aeotec.

And at the bottom is log info. Note that it seems like it sees the Z-Stick, then can’t get it to respond, then it’s there, then it’s dead. I know, for some reason, this system has had issues with this Z-Stick getting the right USB device node or information so it’s easily found. (I’m wondering if udev could be used to fix that.)

I have a Zooz Z-Stick, bought at the same time as the Aeotec one, that is sitting in the box, and perfectly able to be used if I need it. My concern is that, if I remember, the Z-Wave device info is stored in the controller itself, so if I swap to a new controller, I lose that, have to remove each device from the network, then add it to the new one - and then I’ll have to update the info in HA - and I don’t know if that means including the automations and scenes, or if I can just use the same name and have them apply to the new instance with the new controller.

This is frustrating for me, but it’s infuriating for my wife. I’d really like to get this taken care of, and if it’s an issue with communications with this stick, I’d like to fix it once and for all, even if it means adding the Zooz stick and removing the Aeotec one. (If I do that, can I add the Zooz and move one or two devices over at a time, until they’re all switched over?)


Logs:

2024-06-24 02:50:09.945 INFO Z-WAVE: Controller status: Controller is Ready
2024-06-24T06:50:13.938Z DRIVER all queues busy
2024-06-24T06:50:13.948Z SERIAL » 0x0103003bc7 (5 bytes)
2024-06-24T06:50:13.949Z DRIVER » [REQ] [GetBackgroundRSSI]
2024-06-24T06:50:14.954Z CNTRLR Failed to execute controller command after 1/3 attempts. Scheduling next try i
n 100 ms.
2024-06-24T06:50:15.057Z DRIVER » [REQ] [GetBackgroundRSSI]
2024-06-24T06:50:15.059Z SERIAL » 0x0103003bc7 (5 bytes)
2024-06-24T06:50:16.066Z CNTRLR Failed to execute controller command after 2/3 attempts. Scheduling next try i
n 1100 ms.
2024-06-24T06:50:17.170Z DRIVER » [REQ] [GetBackgroundRSSI]
2024-06-24T06:50:17.172Z SERIAL » 0x0103003bc7 (5 bytes)
2024-06-24T06:50:18.178Z CNTRLR The controller is unresponsive
2024-06-24 02:50:18.180 INFO Z-WAVE: Controller status: Controller is unresponsive
2024-06-24T06:50:18.182Z DRIVER Attempting to recover unresponsive controller by restarting it…
2024-06-24T06:50:18.183Z CNTRLR Performing soft reset…
2024-06-24T06:50:18.192Z SERIAL » 0x01030008f4 (5 bytes)
2024-06-24T06:50:18.193Z DRIVER » [REQ] [SoftReset]
2024-06-24T06:50:19.199Z CNTRLR Failed to execute controller command after 1/3 attempts. Scheduling next try i
n 100 ms.
2024-06-24T06:50:19.301Z DRIVER » [REQ] [SoftReset]
2024-06-24T06:50:19.303Z SERIAL » 0x01030008f4 (5 bytes)
2024-06-24 02:50:19.790 INFO APP: GET /health/zwave 301 1.063 ms - 191
2024-06-24T06:50:20.310Z CNTRLR Failed to execute controller command after 2/3 attempts. Scheduling next try i
n 1100 ms.
2024-06-24T06:50:21.413Z DRIVER » [REQ] [SoftReset]
2024-06-24T06:50:21.414Z SERIAL » 0x01030008f4 (5 bytes)
2024-06-24T06:50:22.426Z CNTRLR Soft reset failed: Timeout while waiting for an ACK from the controller (ZW020
0)
2024-06-24T06:50:22.428Z DRIVER Attempting to recover unresponsive controller by reopening the serial port…
2024-06-24T06:50:22.430Z DRIVER all queues idle
2024-06-24T06:50:23.448Z DRIVER Serial port reopened. Returning to normal operation and hoping for the best…
2024-06-24T06:50:23.448Z CNTRLR The controller is no longer unresponsive
2024-06-24 02:50:23.449 INFO Z-WAVE: Controller status: Controller is Ready
2024-06-24T06:50:43.937Z DRIVER all queues busy
2024-06-24T06:50:43.944Z SERIAL » 0x0103003bc7 (5 bytes)
2024-06-24T06:50:43.946Z DRIVER » [REQ] [GetBackgroundRSSI]
2024-06-24T06:50:44.953Z CNTRLR Failed to execute controller command after 1/3 attempts. Scheduling next try i
n 100 ms.
2024-06-24T06:50:45.055Z DRIVER » [REQ] [GetBackgroundRSSI]
2024-06-24T06:50:45.057Z SERIAL » 0x0103003bc7 (5 bytes)
2024-06-24T06:50:46.063Z CNTRLR Failed to execute controller command after 2/3 attempts. Scheduling next try i
n 1100 ms.
2024-06-24T06:50:47.166Z DRIVER » [REQ] [GetBackgroundRSSI]
2024-06-24T06:50:47.167Z SERIAL » 0x0103003bc7 (5 bytes)
2024-06-24T06:50:48.174Z CNTRLR The controller is unresponsive
2024-06-24 02:50:48.175 INFO Z-WAVE: Controller status: Controller is unresponsive
2024-06-24T06:50:48.177Z DRIVER Attempting to recover unresponsive controller by restarting it…
2024-06-24T06:50:48.178Z CNTRLR Performing soft reset…
2024-06-24T06:50:48.186Z SERIAL » 0x01030008f4 (5 bytes)
2024-06-24T06:50:48.187Z DRIVER » [REQ] [SoftReset]
2024-06-24T06:50:49.196Z CNTRLR Failed to execute controller command after 1/3 attempts. Scheduling next try i
n 100 ms.
2024-06-24T06:50:49.298Z DRIVER » [REQ] [SoftReset]
2024-06-24T06:50:49.300Z SERIAL » 0x01030008f4 (5 bytes)
2024-06-24 02:50:49.906 INFO APP: GET /health/zwave 301 1.102 ms - 191
2024-06-24T06:50:50.305Z CNTRLR Failed to execute controller command after 2/3 attempts. Scheduling next try i
n 1100 ms.
2024-06-24T06:50:51.408Z DRIVER » [REQ] [SoftReset]
2024-06-24T06:50:51.410Z SERIAL » 0x01030008f4 (5 bytes)
2024-06-24T06:50:52.417Z CNTRLR Soft reset failed: Timeout while waiting for an ACK from the controller (ZW020
0)
2024-06-24T06:50:52.418Z DRIVER Attempting to recover unresponsive controller by reopening the serial port…
2024-06-24T06:50:52.421Z DRIVER all queues idle
2024-06-24T06:50:53.434Z DRIVER Serial port reopened. Returning to normal operation and hoping for the best…
2024-06-24T06:50:53.436Z CNTRLR The controller is no longer unresponsive
2024-06-24 02:50:53.437 INFO Z-WAVE: Controller status: Controller is Ready
2024-06-24T06:51:13.938Z DRIVER all queues busy
2024-06-24T06:51:13.944Z SERIAL » 0x0103003bc7 (5 bytes)
2024-06-24T06:51:13.945Z DRIVER » [REQ] [GetBackgroundRSSI]
2024-06-24T06:51:14.953Z CNTRLR Failed to execute controller command after 1/3 attempts. Scheduling next try i
n 100 ms.
2024-06-24T06:51:15.056Z DRIVER » [REQ] [GetBackgroundRSSI]
2024-06-24T06:51:15.057Z SERIAL » 0x0103003bc7 (5 bytes)
2024-06-24T06:51:16.063Z CNTRLR Failed to execute controller command after 2/3 attempts. Scheduling next try i
n 1100 ms.
2024-06-24T06:51:17.169Z DRIVER » [REQ] [GetBackgroundRSSI]
2024-06-24T06:51:17.170Z SERIAL » 0x0103003bc7 (5 bytes)
2024-06-24T06:51:18.179Z CNTRLR The controller is unresponsive
2024-06-24 02:51:18.181 INFO Z-WAVE: Controller status: Controller is unresponsive
2024-06-24T06:51:18.182Z DRIVER Attempting to recover unresponsive controller by restarting it…
2024-06-24T06:51:18.183Z CNTRLR Performing soft reset…
2024-06-24T06:51:18.190Z SERIAL » 0x01030008f4 (5 bytes)
2024-06-24T06:51:18.190Z DRIVER » [REQ] [SoftReset]
2024-06-24T06:51:19.195Z CNTRLR Failed to execute controller command after 1/3 attempts. Scheduling next try i
n 100 ms.
2024-06-24T06:51:19.297Z DRIVER » [REQ] [SoftReset]
2024-06-24T06:51:19.298Z SERIAL » 0x01030008f4 (5 bytes)
2024-06-24 02:51:20.003 INFO APP: GET /health/zwave 301 1.158 ms - 191
2024-06-24T06:51:20.304Z CNTRLR Failed to execute controller command after 2/3 attempts. Scheduling next try i
n 1100 ms.
2024-06-24T06:51:21.408Z DRIVER » [REQ] [SoftReset]
2024-06-24T06:51:21.409Z SERIAL » 0x01030008f4 (5 bytes)
2024-06-24T06:51:22.418Z CNTRLR Soft reset failed: Timeout while waiting for an ACK from the controller (ZW020
0)
2024-06-24T06:51:22.420Z DRIVER Attempting to recover unresponsive controller by reopening the serial port…
2024-06-24T06:51:22.423Z DRIVER all queues idle
2024-06-24T06:51:23.439Z DRIVER Serial port reopened. Returning to normal operation and hoping for the best…
2024-06-24T06:51:23.440Z CNTRLR The controller is no longer unresponsive
2024-06-24 02:51:23.441 INFO Z-WAVE: Controller status: Controller is Ready
2024-06-24 02:51:34.756 DEBUG SOCKET: User disconnected from WxHpxZbHSuLyiraRAAAH: transport close

Did u power cycle the box and remove the usb stick?

Power cycled - but didn’t think to shut off, remove the stick, put it back in, and try it.

I know I had a similar issue before - forgot how I fixed it, but I’m wondering if the stick is going bad, so I’m looking into replacing it. It’s just that’d be a real pain!

It was later in the evening before I could get to my Pi to shut it down, pull the Z-Stick, then put it back in and turn it back on. It worked.

That leaves me with the concern or question of just what is going on. This happens on this HA instance on this Pi, but not on the other. I’m wondering just what could be going on that could make it go down like this and still come up and work when I do this. I’d like to find a way to prevent it.

This is a fairly common problem when doing HA updates. Hardware / firmware sometimes gets in a bad state and needs a power cycle. Having a spare stick and viable NVM backups is a good practice.

Here’s what I do.

a) update once a quarter to the last release on the month (.3 or .4), use this as an opportunity to power cycle the computer.
b) only do this when I have time and am around for the next couple of days
c) do one system first, if it’s stable after a week do the second
d) take NVM backups before and after adding new devices
e) periodically (2x a year), restore the NVM backup to spare stick and run on the spare stick.

You make some good points. I get really tired of seeing updates, as if it’s just demanding my attention, every week. I’ve often felt like HA should offer a “Stable” option, maybe like Debian, where their stable version has been so heavily tested. Or something like Ubuntu with the long term support on some versions. Dealing with so many updates is, in many ways, contrary to the idea of home automation. Home automation is so we don’t have to deal with some things - we can just leave 'em automated. To do it on HA, though, it’d have to be different.

I’ve thought of suggesting this, but I seriously don’t think devs would like the idea, since it’d mean more work, but since HA is now going to be selling boxes with HA on them, it would be a good idea for that. They’d have to focus on a cycle, like 3, 4, or maybe 6 months, and take an update (on the core, HAOS, and other major add-ons and integrations) that has been out and has proven to be a minimum of problems and push that out as a “stable update” and the end of the stable update cycle. That way people who subscribe to stable updates would only get updates once in a while and those updates would be well tested first.

I think I’m going to start doing what you’re doing - only do updates once in a while, and use it as a chance to power-cycle the system. (I know on a Pi, that doing a reboot doesn’t do a full power cycle, so I think there’s a good chance just power-cycling it might have fixed my issue.)

The odd thing is the barn system never seems to have the issues the house system does!