Raspberry Pi 4 and Aeon Labs Zstick Gen 5: Dead Nodes start appearing after HA restart

LeapFrog · July 26, 2019, 9:58pm

I am posting this here in case anyone has a similar issue and/or it helps anyone else.

System configuration:
2GB Raspberry Pi 4B
OEM Power Supply
Buster 10.0 (7/10/19)
Docker CE nightly 0.0.0-20190723010156-67a3bd4
Home Assistant 0.96.4 on Docker
Aeon Labs Gen5 USB Zstick (no network key or secure nodes)
Generic USB hub (to work around the issue with proper device detection of the ZStick)

Problem Description:
An initial installation of HA on an RPi 4B with a ~25 node ZW configuration starts and works nominally. There are no dead nodes, all nodes respond and operate. Over a two-day period there are no dead or inoperable nodes, and no notable errors in the OZW.log

Diagnostic Info:
The OZW.log file for Node031 reports this for SwitchAllCmd_Get:

2019-07-26 13:46:14.960 Info, Node031, Sending (Query) message (Callback ID=0xa4, Expected Reply=0x04) - SwitchAllCmd_Get (Node=31): 0x01, 0x09, 0x00, 0x13, 0x1f, 0x02, 0x27, 0x02, 0x25, 0xa4, 0x5c
2019-07-26 13:46:14.968 Detail, Node031,   Received: 0x01, 0x04, 0x01, 0x13, 0x01, 0xe8
2019-07-26 13:46:14.968 Detail, Node031,   ZW_SEND_DATA delivered to Z-Wave stack
2019-07-26 13:46:15.087 Detail, Node031,   Received: 0x01, 0x07, 0x00, 0x13, 0xa4, 0x00, 0x00, 0x0d, 0x42
2019-07-26 13:46:15.087 Detail, Node031,   ZW_SEND_DATA Request with callback ID 0xa4 received (expected 0xa4)
2019-07-26 13:46:15.087 Info, Node031, Request RTT 127 Average Request RTT 143
2019-07-26 13:46:15.087 Detail,   Expected callbackId was received
2019-07-26 13:46:15.134 Detail, Node031,   Received: 0x01, 0x09, 0x00, 0x04, 0x00, 0x1f, 0x03, 0x27, 0x03, 0xff, 0x35
2019-07-26 13:46:15.135 Detail, 
2019-07-26 13:46:15.135 Info, Node031, Response RTT 175 Average Response RTT 204
2019-07-26 13:46:15.135 Detail, Node031, Initial read of value
2019-07-26 13:46:15.135 Info, Node031, Received SwitchAll report from node 31: On and Off Enabled
2019-07-26 13:46:15.135 Detail, Node031,   Expected reply and command class was received
2019-07-26 13:46:15.135 Detail, Node031,   Message transaction complete

Issue:
After a restart of the RPi 4B (without configuration changes) one or two dead nodes appear, and the OZW.log reports:

2019-07-26 14:54:06.607 Info, Node031, Sending (Query) message (Callback ID=0xa5, Expected Reply=0x04) - SwitchAllCmd_Get (Node=31): 0x01, 0x09, 0x00, 0x13, 0x1f, 0x02, 0x27, 0x02, 0x25, 0xa5, 0x5d
2019-07-26 14:54:06.731 Detail, Node031,   Received: 0x01, 0x04, 0x01, 0x13, 0x01, 0xe8
2019-07-26 14:54:06.731 Detail, Node031,   ZW_SEND_DATA delivered to Z-Wave stack
2019-07-26 14:54:12.241 Detail, Node031,   Received: 0x01, 0x07, 0x00, 0x13, 0xa5, 0x01, 0x02, 0x34, 0x79
2019-07-26 14:54:12.241 Detail, Node031,   ZW_SEND_DATA Request with callback ID 0xa5 received (expected 0xa5)
2019-07-26 14:54:12.242 Info, Node031, WARNING: ZW_SEND_DATA failed. No ACK received - device may be asleep.
2019-07-26 14:54:12.242 Warning, Node031, WARNING: Device is not a sleeping node.
2019-07-26 14:54:12.242 Detail, Node001,   Expected callbackId was received
2019-07-26 14:54:14.606 Error, Node031, ERROR: Dropping command, expected response not received after 1 attempt(s)
2019-07-26 14:54:14.606 Detail, Node031, Removing current message
2019-07-26 14:54:14.607 Detail, Node031, Notification: Notification - TimeOut
....
2019-07-26 14:54:14.614 Info, Node031, Sending (Send) message (Callback ID=0xde, Expected Reply=0x04) - SwitchBinaryCmd_Get (Node=31): 0x01, 0x09, 0x00, 0x13, 0x1f, 0x02, 0x25, 0x02, 0x25, 0xde, 0x24
2019-07-26 14:54:14.743 Detail, Node031,   Received: 0x01, 0x04, 0x01, 0x13, 0x01, 0xe8
2019-07-26 14:54:14.744 Detail, Node031,   ZW_SEND_DATA delivered to Z-Wave stack
2019-07-26 14:54:19.979 Detail, Node031,   Received: 0x01, 0x07, 0x00, 0x13, 0xde, 0x01, 0x02, 0x18, 0x2e
2019-07-26 14:54:19.979 Detail, Node031,   ZW_SEND_DATA Request with callback ID 0xde received (expected 0xde)
2019-07-26 14:54:19.979 Info, Node031, WARNING: ZW_SEND_DATA failed. No ACK received - device may be asleep.
2019-07-26 14:54:19.979 Warning, Node031, WARNING: Device is not a sleeping node.
2019-07-26 14:54:19.979 Error, Node031, ERROR: node presumed dead
....
2019-07-26 14:54:19.979 Warning, CheckCompletedNodeQueries m_allNodesQueried=0 m_awakeNodesQueried=0
2019-07-26 14:54:19.979 Warning, CheckCompletedNodeQueries all=0, deadFound=1 sleepingOnly=0
2019-07-26 14:54:19.979 Detail, Node001,   Expected callbackId was received
2019-07-26 14:54:19.979 Detail, Node031, Notification: Notification - Node Dead

A second restart increase the number of dead nodes from one to four or more. It does not matter if you reboot the RPi 4B in between HA restarts, the same scenario is repeatable.

Resolution:
At this point removing the Zstick and installing it in an RPi 3B+ with the same (prior to the problem) /config directory works correctly across restarts with no dead nodes. Subsequently, moving the Zstick BACK to the RPi 4B, rebooting and restarting the RPi 4B removes the dead nodes and HA works normally until the first restart, when the dead nodes reappear. This scenario is also repeatable, although I have not noticed whether there is any similarity in which nodes fail upon each restart.

If anyone comes across this topic and has similar issues it might help identify the root cause of the problem.

UPDATE - Edited 07-27-2019:
After thinking about the differences between the Rpi 3B+ and the Rpi 4B it occurred to me that the Rpi 3B+ was in a slightly different location (a few feet away) from the Rpi 4B. So I swapped the locations and voilà, the dead node issues with the Rpi 4B went away. Noteably, the Rpi 3B+ continued to work as well. So I swapped the locations back and was able to replicate the dead nodes on the Rpi 4B in the original location. It 'tis a mystery what is different between the two locations other than a few feet of distance.

firstof9 · July 26, 2019, 10:17pm

Also there’s this issue with the Pi 4 and the USB sticks:

https://www.raspberrypi.org/forums/viewtopic.php?t=245031

A320Peter · July 26, 2019, 11:39pm

Temporary solution is to use a basic USB 2.0 hub which has a controller insensitive to voltage drop like this. It is not confirmed but very likely that the Aeon Gen5 stick is not complying with some USB standards and it is out of tolerance for the new USB controller chip in the Pi4.

g.nigro · September 3, 2019, 8:42am

I’ve a strange behavior…
Previously i used a raspberry pi 3 with hassio, a Zstick gen 5 and a Multisensor 6
and everything was working ok.
Then Iv’e switched to raspberry 4 with hassio on docker installed on ssd, restored a backup from the raspberry pi 3 (without problems).
To workaround i purchased the hub Sabrent usb 2.0 (no wall powered) and the Zstick gen 5 is recognised as /dev/ttyACM0.
I can connect the Multisensor 6 and if i let it in the same room whit the stick and the raspberry 4, it works ok, but if i put it where it was originally whit the other config the node starts to fail (protocolinfo) and i can’t bring it back .
The Multisrnsor 6 it’s usb powered, i’ve tested with the same results whit a Sabrent usb 3.0 (no wall powered).
seems like a range problem, any idea? i can’t put the Multisensor closer

LeapFrog · September 3, 2019, 10:35am

I too am suspicious of an interference / range issue associated with the pairing of the RPi 4B and the Zstick in close proximity. In my case (OP above) I ended up moving the Zstick about five feet away from the RPi 4B via a USB extension cable, and all of the intermittent ‘node dead’ issues disappeared. Additionally the RTT for all nodes declined by a factor of two or more. Since there are 14 nodes within direct range of the controller it appears this is more than a simple ‘dead spot’ situation.

I’ll speculate that there could be some type of RF interference (908.42 MHz (US)) associated with the RPi 4B that is not present with the RPi 3B+. You’d think that the FCC emissions test would check for this, but…

A320Peter · September 3, 2019, 7:04pm

I can confirm I noticed very similar behavior on Rpi 3b+ too. Dead nodes, unresponsive nodes, etc. I used a good quality well shielded USB extension cable (1.5 meters) and it’s solved. I can’t tell if your problem is the same but you may give it a try.

g.nigro · September 3, 2019, 7:30pm

ok, using a longer cable solved the problem now seems stable

LeapFrog · September 3, 2019, 8:18pm

Very interesting - for those in the US, here’s the FCC certification info - the board is certified under Part 15 (for unlicensed operation in the WiFi bands).

I don’t see any RF emissions testing in the files other than for the two WiFi bands, and the device is certified as a Class B device, which basically means it can emit signals that interfere with other receivers. This is the caveat for Part 15 Class B devices:

This equipment has been tested and found to comply within the limits for a Class B digital device, pursuant to part 15 of the FCC Rules. These limits are designed to provide reasonable protection against harmful interference in a residential installation. This equipment generates, uses, and can radiate radio frequency energy and, if not installed and used in accordance with the instructions, may cause harmful interference to radio communications. However, there is no guarantee that interference will not occur in a particular installation. If this equipment does cause harmful interference to radio or television reception, which can be determined by turning the equipment off and on, the user is encouraged to try to correct the interference by one or more of the following measures: • Re-orient or relocate the receiving antenna • Increase the separation between the equipment and receiver• Connect the equipment into an outlet on a different circuit from that to which the receiver is connected • Consult the dealer or an experienced radio/TV technician for help.

firstof9 · September 4, 2019, 12:20am

If you believe it’s a design flaw you may want to bring it up with the Raspberry Pi foundation.

papadi · March 30, 2021, 5:04am

The USB 2.0 hub hack was the solution for me as well. Only it doesn’t seem to be a very temporary one. Rather a permanent on