ZWave 1.4 -> ZWaveJS migration -- did my Aeotec Gen5 die?

This week I decided to take the plunge and migrate from ZWave (deprecated) to ZWave-JS. It didn’t go well…

My installation is a RPi3B+ with an Aeotec Gen5 USB stick ZW090-A. It’s a largish network, about 125 ZWave devices of various ages and manufacturers. But it’s been working well for several years, as I’ve updated frequently to keep up with the software releases of HA. Haven’t done anything to the Gen5 stick firmware though, it’s just as it came from the factory.

Using the guidance on the HA website, I tried the “Start Migration” process. It ran for a while and then said “Migration failed.” It hadn’t gotten even to the point of telling me what it would be able to migrate. No other info about why it failed.

So I figured I should try the manual method. I powered down and removed the SD chip so that I could go back to the old system that had been working for years. I used a new SD chip (Sandisk 32GB “high endurance”), and flashed it using BalenaEtcher with the current HA release - https://github.com/home-assistant/operating-system/releases/download/7.4/haos_rpi3-64-7.4.img.xz Brought that up and did a restore-from-backup I had taken a few days earlier. Everything seemed to come back online fine, including other integrations (Sonos, Vizio, etc.) Deleted the ZWave integration, power-cycled, rebooted, and installed the ZWave-JS addon, using the network key I had saved from the old install. Looked at the log and it seemed to be methodically working through my ZWave devices. Hours later (125+ devices to interview…) the system showed it had found 125 devices, 124 of which were not ready. Installed the ZWave-JS integration. It showed lots of devices, some of which had the expected manufacturer and device model information. But nothing was “ready”.

Going to Plan B, I powered the RPi down and put the original SD chip back into service. Booted up, but after a few hours all ZWave devices were still in “Unknown” state. In the past, all but battery-powered devices would be fully functional in that time. So now the “old” system no longer works either. Tried again, using older backups, with the same result.

Going to Plan C, I decided to try a totally fresh install. Reflashed a SD chip to the same current HA release, booted it up, and did a new setup rather than a restore from backup. I’d have to set up all my other integrations again of course, but this seemed likely, as a fresh install, to get ZWave-JS running.

Plan C didn’t work either. I captured parts of the ZWave-JS add-on log as it went through the startup process, and watched the LEDs on the Gen5 stick as it worked. Everything looked good at first, but after somewhere around 100 devices the log started showing errors (e.g., “Status No Ack (ZW0204)” and the Gen5 stick seemed to become “stuck”. Excerpts from the log are below if anyone’s curious.

At this point, I’m suspecting that my USB stick died – while it was in the process of migrating. Not sure if that might be just coincidence or because of something that ZWave-JS does that the old ZWave integration didn’t trigger. Perhaps ZWave-JS is going through the interview process much faster than the ZWave 1.4 software and somehow that overloads something and starts getting errors? Curious though that even the old deprecated 1.4 configuration no longer works. Conclusion – the Gen5 stick has died somehow.

So, I’ve ordered a new USB stick (sticking to 500 series with a HUSBZB-1). Wanted to add Zigbee capability anyway. Hopefully I can get my setup working again after getting each ZWave device onto the new stick. It will take a while; some of the devices aren’t easy to reach.

Meanwhile, maybe someone sees some clues in these logs. The symptoms (large network, lots of devices, a dozen or so battery-powered devices, loss of ability to communicate, etc.) sound to me to be very similar to the problem recently reported with the new 700 devices. But my controller is a 500 device. Maybe there was a similar bug in the 500 system too? Or perhaps there’s something from those dead devices in the Gen5’s network data that is somehow causing the behavior I see? There were a bunch of devices that I remember took several tries to Include and left “dead” devices that I never managed to remove, but they didn’t seem to be a problem. Or perhaps my Gen5 stick just died; it’s been running continuously for 6 or 7 years.

Jack Haverty

============ excerpts from ZWave-JS addon logs, with a few comments I added =============
ZWave-JS addon started up, looks like it’s progressing normally. ZStick Gen5 LEDs are cycling yellow/red/blu every few seconds as expected.

2022-02-12T18:21:26.880Z CNTRLR [Node 068] The node is alive.
2022-02-12T18:21:26.889Z CNTRLR « [Node 068] ping successful
2022-02-12T18:21:26.891Z CNTRLR [Node 068] Interviewing Manufacturer Specific…
2022-02-12T18:21:26.892Z CNTRLR » [Node 068] querying manufacturer information…
2022-02-12T18:21:26.980Z CNTRLR [Node 069] The node is alive.
2022-02-12T18:21:27.002Z CNTRLR « [Node 069] ping successful
2022-02-12T18:21:27.004Z CNTRLR [Node 069] Interviewing Manufacturer Specific…
2022-02-12T18:21:27.005Z CNTRLR » [Node 069] querying manufacturer information…
2022-02-12T18:21:27.086Z CNTRLR [Node 070] The node is alive.
2022-02-12T18:21:27.096Z CNTRLR « [Node 070] ping successful
2022-02-12T18:21:27.097Z CNTRLR [Node 070] Interviewing Manufacturer Specific…
2022-02-12T18:21:27.098Z CNTRLR » [Node 070] querying manufacturer information…
2022-02-12T18:21:27.154Z CNTRLR [Node 071] The node is alive.
2022-02-12T18:21:27.163Z CNTRLR « [Node 071] ping successful
2022-02-12T18:21:27.164Z CNTRLR [Node 071] Interviewing Manufacturer Specific…
2022-02-12T18:21:27.165Z CNTRLR » [Node 071] querying manufacturer information…
2022-02-12T18:21:27.256Z CNTRLR [Node 072] The node is alive.
2022-02-12T18:21:27.282Z CNTRLR « [Node 072] ping successful
2022-02-12T18:21:27.284Z CNTRLR [Node 072] Interviewing Manufacturer Specific…
2022-02-12T18:21:27.285Z CNTRLR » [Node 072] querying manufacturer information…
2022-02-12T18:21:27.357Z CNTRLR [Node 073] The node is alive.
2022-02-12T18:21:27.368Z CNTRLR « [Node 073] ping successful
2022-02-12T18:21:27.370Z CNTRLR [Node 073] Interviewing Manufacturer Specific…
2022-02-12T18:21:27.371Z CNTRLR » [Node 073] querying manufacturer information…
2022-02-12T18:21:27.468Z CNTRLR [Node 078] The node is alive.
2022-02-12T18:21:27.478Z CNTRLR « [Node 078] ping successful
2022-02-12T18:21:27.480Z CNTRLR [Node 078] Interviewing Manufacturer Specific…
2022-02-12T18:21:27.481Z CNTRLR » [Node 078] querying manufacturer information…
2022-02-12T18:21:27.552Z CNTRLR [Node 088] The node is alive.
2022-02-12T18:21:27.562Z CNTRLR « [Node 088] ping successful
2022-02-12T18:21:27.563Z CNTRLR [Node 088] Interviewing Manufacturer Specific…
2022-02-12T18:21:27.564Z CNTRLR » [Node 088] querying manufacturer information…
2022-02-12T18:21:27.612Z CNTRLR [Node 098] The node is alive.
2022-02-12T18:21:27.661Z CNTRLR « [Node 098] ping successful
2022-02-12T18:21:27.667Z CNTRLR » [Node 098] Querying securely supported commands (S0)…
2022-02-12T18:21:27.764Z CNTRLR [Node 103] The node is alive.
2022-02-12T18:21:27.775Z CNTRLR « [Node 103] ping successful
2022-02-12T18:21:27.777Z CNTRLR » [Node 103] Querying securely supported commands (S0)…

.
.
.

ZWave JS startup continues, but after a minute or two starts getting errors. Stick LEDs are still cycling but very slowly.

2022-02-12T18:22:20.881Z CNTRLR [Node 122] The node did not respond after 1 attempts, it is presumed dead
2022-02-12T18:22:20.884Z CNTRLR [Node 122] The node is dead.
2022-02-12T18:22:20.907Z CNTRLR [Node 122] ping failed: Failed to send the command after 1 attempts (Status No
Ack) (ZW0204)
2022-02-12T18:22:20.908Z CNTRLR » [Node 122] querying node info…
2022-02-12T18:22:20.912Z CNTRLR » [Node 122] pinging the node…

.
.
.

ZWaveJS interview continues, but everything fails. Stick LEDs are stuck in a single color, not cycling at all

2022-02-12T18:23:07.577Z CNTRLR [Node 135] The node did not respond after 1 attempts, it is presumed dead
2022-02-12T18:23:07.580Z CNTRLR [Node 135] The node is dead.
2022-02-12T18:23:07.601Z CNTRLR [Node 135] ping failed: Failed to send the command after 1 attempts (Status No
Ack) (ZW0204)
2022-02-12T18:23:07.602Z CNTRLR » [Node 135] querying node info…
2022-02-12T18:23:07.603Z CNTRLR » [Node 135] pinging the node…
2022-02-12T18:23:11.806Z CNTRLR [Node 139] The node did not respond after 1 attempts, it is presumed dead
2022-02-12T18:23:11.809Z CNTRLR [Node 139] The node is dead.
2022-02-12T18:23:11.835Z CNTRLR [Node 139] ping failed: Failed to send the command after 1 attempts (Status No
Ack) (ZW0204)
2022-02-12T18:23:11.836Z CNTRLR » [Node 139] querying node info…
2022-02-12T18:23:11.838Z CNTRLR » [Node 139] pinging the node…
2022-02-12T18:23:11.972Z CNTRLR [Node 145] The node is alive.
2022-02-12T18:23:11.985Z CNTRLR « [Node 145] ping successful
2022-02-12T18:23:11.986Z CNTRLR [Node 145] Interviewing Manufacturer Specific…
2022-02-12T18:23:11.987Z CNTRLR » [Node 145] querying manufacturer information…
2022-02-12T18:23:13.624Z CNTRLR No response from controller after 1/3 attempts. Scheduling next try in 100 ms.
2022-02-12T18:23:14.013Z CNTRLR Failed to execute controller command after 2/3 attempts. Scheduling next try i
n 1100 ms.
2022-02-12T18:23:15.563Z CNTRLR [Node 146] The node is alive.
2022-02-12T18:23:15.578Z CNTRLR « [Node 146] ping successful
2022-02-12T18:23:15.579Z CNTRLR » [Node 146] querying node info…
2022-02-12T18:23:15.640Z CNTRLR [Node 147] The node is alive.
2022-02-12T18:23:15.653Z CNTRLR « [Node 147] ping successful
2022-02-12T18:23:15.654Z CNTRLR [Node 147] Interviewing Manufacturer Specific…
2022-02-12T18:23:15.655Z CNTRLR » [Node 147] querying manufacturer information…
2022-02-12T18:23:19.865Z CNTRLR [Node 148] The node did not respond after 1 attempts, it is presumed dead
2022-02-12T18:23:19.867Z CNTRLR [Node 148] The node is dead.
2022-02-12T18:23:19.885Z CNTRLR [Node 148] ping failed: Failed to send the command after 1 attempts (Status No
Ack) (ZW0204)
2022-02-12T18:23:19.886Z CNTRLR » [Node 148] querying node info…
2022-02-12T18:23:19.887Z CNTRLR » [Node 148] pinging the node…

ZWave stick LEDs are now off, no color shown at all.

Have you tried removing and reinstalling each device individually? ie exclusion then inclusion.

No, I plan to do that with a new stick when it arrives, since it seems that HA now has trouble interacting with my old stick. Clicking “Add Device” in Configuration/Devices brings up the window with the spinning circle, but that’s as far as it gets. So I suspect the stick itself is no longer working. Hopefully it will work with a new stick…

Reading here

That you can install zwavejs2mqtt and reset. 6th post down.

Hi.

I’ve experienced a similar problem. I think it was due to a faulty device on the network, as the Z-wave network has been stable for the last days, but I’m a bit hesitant to reboot my PI 4, as I’m waiting on a new 700 series controller in case the problem is indeed my Aeotec Gen5 USB stick.

I wonder if the new z-wave integration handles heavy traffic or retries differently, but I’m no dev.

After a reboot (not just HA, but “reboot host” via Supervisor) previously the z-wave network would report almost all my devices as “dead”. I only have 46 devices, but 30ish would be dead after a reboot, and it would take hours until it settled down to around 10 dead devices. Sometimes healing would work to get the last devices online, but there seems to be a bug that keeps healing from finishing, so even days after initiating heal it would say that “another heal is in progress” i i tried healing just one device that was still “dead”.

What would work though is “zwave_js.ping” available in developer tools and services. A dead node would most times come back alive after is was pinged. I’ve seen scripts to automate this on this forum.

After removing one z-wave device which have had problems previously and that did not wake up anymore via ping I’ve been able to get all my devices back online, so right now it seems that it was the problem.

If you are using the “terminal” plug in or similar do a “dmesg | grep -i usb” to see if there are dis/reconnects of USB devices on the host.

Here are some logs from when I had problems:

2021-12-31T11:33:11.915Z CNTRLR   No response from controller after 1/3 attempts. Scheduling next try in 100 ms.
2021-12-31T11:33:12.218Z CNTRLR   Failed to execute controller command after 2/3 attempts. Scheduling next try i
                                  n 1100 ms.
2021-12-31T11:33:13.631Z CNTRLR   The controller response indicated failure after 1/3 attempts. Scheduling next 
                                  try in 100 ms.
2021-12-31T11:33:13.755Z CNTRLR   The controller response indicated failure after 2/3 attempts. Scheduling next 
                                  try in 1100 ms.
2021-12-31T11:33:23.555Z CNTRLR   Failed to execute controller command after 1/3 attempts. Scheduling next try i
                                  n 100 ms.
2021-12-31T11:33:57.880Z CNTRLR   No response from controller after 1/3 attempts. Scheduling next try in 100 ms.
2021-12-31T11:33:57.998Z CNTRLR   The controller response indicated failure after 2/3 attempts. Scheduling next 
                                  try in 1100 ms.
2021-12-31T11:34:19.697Z DRIVER   Dropping message with invalid payload:
                                  0x01060004004300be
2021-12-31T11:35:57.254Z CNTRLR « [Node 069] refreshing neighbor list failed...
2021-12-31T11:37:14.571Z DRIVER   Dropping message because the driver is not ready to handle it yet.
2021-12-31T11:37:14.582Z DRIVER   Dropping message because the driver is not ready to handle it yet.
2021-12-31T11:37:14.594Z DRIVER   Dropping message because the driver is not ready to handle it yet.
2021-12-31T11:37:14.627Z DRIVER   Dropping message because the driver is not ready to handle it yet.
2021-12-31T11:37:14.644Z DRIVER   Dropping message because the driver is not ready to handle it yet.
2021-12-31T11:37:14.657Z DRIVER   Dropping message because the driver is not ready to handle it yet.
2021-12-31T11:37:14.738Z DRIVER   Dropping message because the driver is not ready to handle it yet.
2021-12-31T11:37:14.753Z DRIVER   Dropping message because the driver is not ready to handle it yet.

Thanks for the info. My logs look pretty much the same. The result is that nothing works and I’m waiting for a new controller to arrive.

Meanwhile, I also tried removing all ZWave-JS stuff again, and going through the process to install zwave2mqtt instead. That process went well, but the logs show a similar pattern of errors as the interview process goes on. The Gen5 device shows up, but nothing else gets through the interview stages (but it’s only been an hour so far.)

Since my “old” ZWave 1.4 setup, i.e., using the old SD card that I removed when I started the migration, no longer works either, I suspect there’s something wrong with the Aeotec Gen5. So I’m just waiting for the new hardware to arrive.

Tnx,
Jack

More info. I’ve been continuing the zwavejs2mqtt configuration and some devices have actually started to appear. There are 124 devices, all in some stage of interview after several hours. Many are at “Node Info” stage; some still at “Protocol Info”. Some info, e.g., device types, manufacturers, etc., have started to appear in the zwavejs2mqtt control panel. Nothing is “Ready” yet,

Looking at the logs, some communication is happening through the Gen5 stick, but there are lots of errors as reported earlier. I’ve discovered that the visual behavior of the Gen5 stick changes as the interview process goes on. The normal pattern of the LEDs is cycling blue/yellow/red every few seconds. During the interview process, the LEDs change behavior. Sometimes they get stuck in one color for a while; sometimes they flash very rapidly.

Eventually the LEDs become stuck with no color at all. At that point the HA logs show just continuous errors.

I’ve also discovered that if I unplug the Gen5 stick and plug it back in, behavior goes back to the “cycling” pattern, and the HA logs show some successful communications happening again. It’s possible that the system will eventually finish the interview process and everything come online (create entities etc.), if I just keep unplugging/plugging the Gen5 stick every time it gets stuck.

It’s not clear to me if this means the Gen5 stick is somehow failing, or if something that the zwave-js code does is causing the Gen5 stick to crash. The symptoms and configuration (lots of devices) seems similar to the posts I’ve seen about problems with the newer 700 ZWave interfaces. But I think the Gen5 is a 500 device. Perhaps the interview process with so many devices somehow overloads the Gen5.

FYI, the stick is on a long USB extension cable to avoid interference (which wasn’t a problem even when plugged directly into the RPi).

My next step is to try the new stick - different manufacturer - when it arrives.

I recommend you make a backup of the stick while it still seems like it is alive before you switch to the new one.

it will make it way easier than needing to re-pair all of those devices again.

Thanks, I’ll try that again. Before I started migration, I tried to use the backup tool for Gen5, but it doesn’t seem to work under Wine. So I’ll have to power up my old Windows system. That always takes a while as all the patches et al have to get installed before I can use it. In any case, I haven’t changed anything on the Gen5 at all (unless HA did something), so I’ll still have the old stick available with its network data if it proves useful.

I’m pretty much resigned to having to re-pair everything with a new controller. Had to do that a few years ago when I moved from a Vera controller to a new Vera controller, and then to the HA/RPi/Gen5 setup. I discovered that an Aeottec MiniMote is very helpful in that process. If the Mini is not configured to be on any ZWave network, it can be used to force devices to forget their current network data without needing to go through any exclusion using the old controller. Makes the process of moving devices a lot easier.

Today, my “broken” HA system seems to have deteriorated further. It’s still running and can display the default “Overview” page with all of my non-Zwave devices. But the ZWave-JS and Add-ons pages just come up with a blank frame. Even SSH no longer works.

When I get the new controller, I plan to start from scratch with a fresh install on a reflashed SD chip with the latest release, and see how that goes.

Thanks for the help!
Jack

Computers are such fun…