Z-wave Dropping Command, Slow startup, Long Hickups (Troubleshooting)

This is my ‘story’ and how i managed to find the rotten eggs in my network of z-wave. I have a network which consists about roughly 60 devices. However lately it was taking more then 20 minutes and sometimes even longer before it was ready in Hass. Sometimes it worked like a champ, then it took longer then normal and sometimes it would not respond at all.

I started healing the network, took ages didn’t help. Checked for dead nodes didn’t help. Tried a windows danfoss tool and installed the z-wave stick on my windows and it would startup and work instantly (since it didn’t do a neighborhood scan) so i knew the ‘network’ was “ok”

My next action was that i started up ozwcp (even though hass should be able to manage it all, i find it handy to use in these types of issues) anyways, it didn’t show much besides a lot of duplicate messages. Which was very odd also for a device that was working correctly. Anyhow, i decided to remove my zwcfg_xxxxx.xml file to start fresh from the data from the stick. After some time it was ready and this was my end result when i first ran it in ozcwp:

Mind you, ozwcp gives these names like “Fibaro” default but ONLY on a fresh new setup or if you add a new one. But in my case it seemed it had stopped at a certain device. Node055 to be exact. So keeping the OZW logs open i did a refresh of that device:

2018-04-11 20:32:23.303 Info, Node056, NoOperation::Set - Routing=true
2018-04-11 20:32:23.303 Detail, Node056, Queuing (NoOp) NoOperation_Set (Node=56): 0x01, 0x09, 0x00, 0x13, 0x38, 0x02, 0x00, 0x00, 0x25, 0x6f, 0x95
2018-04-11 20:32:23.304 Detail, Node056, Queuing (Query) Query Stage Complete (Probe)
2018-04-11 20:32:23.304 Detail,
2018-04-11 20:32:23.304 Info, Node056, Sending (NoOp) message (Callback ID=0x6f, Expected Reply=0x13) - NoOperation_Set (Node=56): 0x01, 0x09, 0x00, 0x13, 0x38, 0x02, 0x00, 0x00, 0x25, 0x6f, 0x95
2018-04-11 20:32:23.314 Detail, Node056,   Received: 0x01, 0x04, 0x01, 0x13, 0x01, 0xe8
2018-04-11 20:32:23.314 Detail, Node056,   ZW_SEND_DATA delivered to Z-Wave stack
2018-04-11 20:32:33.305 Error, Node056, ERROR: Dropping command, expected response not received after 1 attempt(s)
2018-04-11 20:32:33.305 Detail, Node056, Removing current message
2018-04-11 20:32:33.305 Detail, Node056, Notification: Notification - TimeOut
2018-04-11 20:32:33.308 Detail, Node056, Query Stage Complete (Probe)
2018-04-11 20:32:33.308 Detail, Node056, AdvanceQueries queryPending=0 queryRetries=0 queryStage=WakeUp live=1
2018-04-11 20:32:33.308 Detail, Node056, QueryStage_WakeUp
2018-04-11 20:32:33.308 Detail, Node056, QueryStage_ManufacturerSpecific1
2018-04-11 20:32:33.308 Detail, Node056, Checking for ManufacturerSpecific CC and Requesting values if present on this node
2018-04-11 20:32:33.308 Detail, Node056, QueryStage_NodePlusInfo
2018-04-11 20:32:33.308 Detail, Node056, QueryStage_SecurityReport
2018-04-11 20:32:33.308 Detail, Node056, QueryStage_ManufacturerSpecific2
2018-04-11 20:32:33.308 Detail, Node056, QueryStage_Versions
2018-04-11 20:32:33.308 Detail, Node056, QueryStage_Instances
2018-04-11 20:32:33.308 Info, Node056, Essential node queries are complete
2018-04-11 20:32:33.308 Detail, Node056, QueryStage_Static
2018-04-11 20:32:33.308 Detail, Node056, QueryStage_Associations
2018-04-11 20:32:33.308 Detail, Node056, QueryStage_Neighbors
2018-04-11 20:32:33.309 Detail, Requesting routing info (neighbor list) for Node 56
2018-04-11 20:32:33.309 Detail, Node056, Queuing (Command) Get Routing Info (Node=56): 0x01, 0x07, 0x00, 0x80, 0x38, 0x00, 0x00, 0x03, 0x43
2018-04-11 20:32:33.309 Detail, Node056, Queuing (Query) Query Stage Complete (Neighbors)
2018-04-11 20:32:33.309 Detail, Node056, Notification: EssentialNodeQueriesComplete
2018-04-11 20:32:33.311 Detail,
2018-04-11 20:32:33.311 Info, Node056, Sending (Command) message (Callback ID=0x00, Expected Reply=0x80) - Get Routing Info (Node=56): 0x01, 0x07, 0x00, 0x80, 0x38, 0x00, 0x00, 0x03, 0x43
2018-04-11 20:32:34.311 Error, Node056, ERROR: Dropping command, expected response not received after 1 attempt(s)
2018-04-11 20:32:34.311 Detail, Node056, Removing current message
2018-04-11 20:32:34.311 Detail, Node056, Notification: Notification - TimeOut
2018-04-11 20:32:34.314 Detail, Node056, Query Stage Complete (Neighbors)
2018-04-11 20:32:34.314 Detail, Node056, AdvanceQueries queryPending=0 queryRetries=0 queryStage=Session live=1
2018-04-11 20:32:34.315 Detail, Node056, QueryStage_Session
2018-04-11 20:32:34.315 Detail, Node056, QueryStage_Dynamic
2018-04-11 20:32:34.315 Detail, Node056, QueryStage_Configuration
2018-04-11 20:32:34.315 Detail, Node056, QueryStage_Complete
2018-04-11 20:32:34.315 Warning, CheckCompletedNodeQueries m_allNodesQueried=1 m_awakeNodesQueried=1
2018-04-11 20:32:34.315 Detail, Node056, Notification: NodeQueriesComplete

After that i try to switch node022, which is a normal functioning binary switch for my light which normally works.

2018-04-11 20:32:47.358 Info, Node022, Value::Set - COMMAND_CLASS_SWITCH_BINARY - Switch - 0 - 1 - False
2018-04-11 20:32:47.359 Info, Node022, SwitchBinary::Set - Setting node 22 to Off
2018-04-11 20:32:47.359 Detail, Node022, Queuing (Send) SwitchBinaryCmd_Set (Node=22): 0x01, 0x0a, 0x00, 0x13, 0x16, 0x03, 0x25, 0x01, 0x00, 0x25, 0x70, 0x82
2018-04-11 20:32:47.359 Detail, Node022, Queuing (Send) SwitchBinaryCmd_Get (Node=22): 0x01, 0x09, 0x00, 0x13, 0x16, 0x02, 0x25, 0x02, 0x25, 0x71, 0x82
2018-04-11 20:32:47.359 Detail,
2018-04-11 20:32:47.359 Info, Node022, Sending (Send) message (Callback ID=0x70, Expected Reply=0x13) - SwitchBinaryCmd_Set (Node=22): 0x01, 0x0a, 0x00, 0x13, 0x16, 0x03, 0x25, 0x01, 0x00, 0x25, 0x70, 0x82
2018-04-11 20:32:48.359 Error, Node022, ERROR: Dropping command, expected response not received after 1 attempt(s)
2018-04-11 20:32:48.359 Detail, Node022, Removing current message
2018-04-11 20:32:48.359 Detail, Node022, Notification: Notification - TimeOut
2018-04-11 20:32:48.363 Detail,
2018-04-11 20:32:48.363 Info, Node022, Sending (Send) message (Callback ID=0x71, Expected Reply=0x04) - SwitchBinaryCmd_Get (Node=22): 0x01, 0x09, 0x00, 0x13, 0x16, 0x02, 0x25, 0x02, 0x25, 0x71, 0x82
2018-04-11 20:32:49.363 Error, Node022, ERROR: Dropping command, expected response not received after 1 attempt(s)
2018-04-11 20:32:49.363 Detail, Node022, Removing current message
2018-04-11 20:32:49.363 Detail, Node022, Notification: Notification - TimeOut

Same issue with Node22 now, also dropping command. Nothing responds anymore.

After some time it comes back with:

2018-04-11 20:34:12.204 Detail,   Received: 0x01, 0x20, 0x01, 0x80, 0x01, 0x00, 0x00, 0x08, 0x00, 0x00, 0x50, 0x08, 0x00, 0x1c, 0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x03
2018-04-11 20:34:12.204 Detail,
2018-04-11 20:34:12.204 Info, Received reply to FUNC_ID_ZW_GET_ROUTING_INFO
2018-04-11 20:34:12.207 Detail, CAN received...triggering resend
2018-04-11 20:34:12.207 Warning, m_currentMsg was NULL when trying to set MaxSendAttempts
2018-04-11 20:34:12.207 Always,
2018-04-11 20:34:12.207 Always, Dumping queued log messages
2018-04-11 20:34:12.207 Always,
2018-04-11 20:34:12.207 Always,
2018-04-11 20:34:12.207 Always, End of queued log message dump
2018-04-11 20:34:12.207 Always,
2018-04-11 20:34:12.207 Detail, WriteMsg CAN m_currentMsg=00000000
2018-04-11 20:34:12.208 Detail, CAN received...triggering resend
2018-04-11 20:34:12.208 Warning, m_currentMsg was NULL when trying to set MaxSendAttempts
2018-04-11 20:34:12.208 Always,
2018-04-11 20:34:12.208 Always, Dumping queued log messages
2018-04-11 20:34:12.208 Always,
2018-04-11 20:34:12.208 Always,
2018-04-11 20:34:12.208 Always, End of queued log message dump
2018-04-11 20:34:12.208 Always,
2018-04-11 20:34:12.208 Detail, WriteMsg CAN m_currentMsg=00000000

Which basicly ammounts for the number of times i tried to send a command. After that it will still fail on me with dropped commands.

So from 20:32 the network was broken, up untill 20:39, so a dead time of 7 minutes!

After that i can see this:

2018-04-11 20:39:39.179 Detail,   Received: 0x01, 0x07, 0x00, 0x13, 0x72, 0x00, 0x00, 0x0f, 0x96
2018-04-11 20:39:39.179 Detail,   ZW_SEND_DATA Request with callback ID 0x72 received (expected 0x00)
2018-04-11 20:39:39.179 Warning, WARNING: Unexpected Callback ID received
2018-04-11 20:39:43.149 Detail,   Received: 0x01, 0x07, 0x00, 0x13, 0x72, 0x01, 0x01, 0x9c, 0x05
2018-04-11 20:39:43.149 Detail,   ZW_SEND_DATA Request with callback ID 0x72 received (expected 0x00)
2018-04-11 20:39:43.150 Warning, WARNING: Unexpected Callback ID received

Upon which it still changes my light switch to the condition i wanted. This is due to the nature of the stack, you can see that in the message “CAN received…triggering resend” so it puts it back on the que. So once it becomes “stable” again it delivers it. Due note that the status in Home Assistant or any tool might be wrong at that point (but switching it twice will resolve that ofcourse)

As an example, Node055 and Node056 are on the same power outlet in my house, so must be something wrong there. But Node057 is a normal powered binary light again. This is the normal output of a Refresh Node:

2018-04-11 20:42:27.784 Detail, Node057, AdvanceQueries queryPending=0 queryRetries=0 queryStage=ProtocolInfo live=1
2018-04-11 20:42:27.784 Detail, Node057, QueryStage_Probe
2018-04-11 20:42:27.784 Info, Node057, NoOperation::Set - Routing=true
2018-04-11 20:42:27.784 Detail, Node057, Queuing (NoOp) NoOperation_Set (Node=57): 0x01, 0x09, 0x00, 0x13, 0x39, 0x02, 0x00, 0x00, 0x25, 0x82, 0x79
2018-04-11 20:42:27.785 Detail, Node057, Queuing (Query) Query Stage Complete (Probe)
2018-04-11 20:42:27.785 Detail,
2018-04-11 20:42:27.785 Info, Node057, Sending (NoOp) message (Callback ID=0x82, Expected Reply=0x13) - NoOperation_Set (Node=57): 0x01, 0x09, 0x00, 0x13, 0x39, 0x02, 0x00, 0x00, 0x25, 0x82, 0x79
2018-04-11 20:42:27.792 Detail, Node057,   Received: 0x01, 0x04, 0x01, 0x13, 0x01, 0xe8
2018-04-11 20:42:27.792 Detail, Node057,   ZW_SEND_DATA delivered to Z-Wave stack
2018-04-11 20:42:27.810 Detail, Node057,   Received: 0x01, 0x07, 0x00, 0x13, 0x82, 0x00, 0x00, 0x02, 0x6b
2018-04-11 20:42:27.810 Detail, Node057,   ZW_SEND_DATA Request with callback ID 0x82 received (expected 0x82)
2018-04-11 20:42:27.810 Info, Node057, Request RTT 24 Average Request RTT 24
2018-04-11 20:42:27.810 Detail,   Expected callbackId was received
2018-04-11 20:42:27.810 Detail,   Expected reply was received
2018-04-11 20:42:27.810 Detail,   Message transaction complete
2018-04-11 20:42:27.810 Detail,
2018-04-11 20:42:27.810 Detail, Node057, Removing current message
2018-04-11 20:42:27.810 Detail, Node057, Notification: Notification - NoOperation
2018-04-11 20:42:27.813 Detail, Node057, Query Stage Complete (Probe)
2018-04-11 20:42:27.813 Detail, Node057, AdvanceQueries queryPending=0 queryRetries=0 queryStage=WakeUp live=1
2018-04-11 20:42:27.813 Detail, Node057, QueryStage_WakeUp
2018-04-11 20:42:27.813 Detail, Node057, QueryStage_ManufacturerSpecific1
2018-04-11 20:42:27.813 Detail, Node057, Checking for ManufacturerSpecific CC and Requesting values if present on this node
2018-04-11 20:42:27.813 Detail, Node057, QueryStage_NodePlusInfo
2018-04-11 20:42:27.813 Detail, Node057, QueryStage_SecurityReport
2018-04-11 20:42:27.813 Detail, Node057, QueryStage_ManufacturerSpecific2
2018-04-11 20:42:27.813 Detail, Node057, QueryStage_Versions
2018-04-11 20:42:27.813 Detail, Node057, QueryStage_Instances
2018-04-11 20:42:27.813 Info, Node057, Essential node queries are complete
2018-04-11 20:42:27.813 Detail, Node057, QueryStage_Static
2018-04-11 20:42:27.813 Detail, Node057, QueryStage_Associations
2018-04-11 20:42:27.813 Detail, Node057, QueryStage_Neighbors
2018-04-11 20:42:27.813 Detail, Requesting routing info (neighbor list) for Node 57
2018-04-11 20:42:27.813 Detail, Node057, Queuing (Command) Get Routing Info (Node=57): 0x01, 0x07, 0x00, 0x80, 0x39, 0x00, 0x00, 0x03, 0x42
2018-04-11 20:42:27.813 Detail, Node057, Queuing (Query) Query Stage Complete (Neighbors)
2018-04-11 20:42:27.813 Detail, Node057, Notification: EssentialNodeQueriesComplete
2018-04-11 20:42:27.814 Detail,
2018-04-11 20:42:27.814 Info, Node057, Sending (Command) message (Callback ID=0x00, Expected Reply=0x80) - Get Routing Info (Node=57): 0x01, 0x07, 0x00, 0x80, 0x39, 0x00, 0x00, 0x03, 0x42
2018-04-11 20:42:27.818 Detail, Node057,   Received: 0x01, 0x20, 0x01, 0x80, 0x01, 0x00, 0x60, 0x79, 0x86, 0x9f, 0x1e, 0x4c, 0x98, 0x0b, 0x60, 0xaf, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x51
2018-04-11 20:42:27.819 Detail,
2018-04-11 20:42:27.819 Info, Node057, Received reply to FUNC_ID_ZW_GET_ROUTING_INFO
2018-04-11 20:42:27.819 Info, Node057,     Neighbors of this node are:
2018-04-11 20:42:27.819 Info, Node057,     Node 1
2018-04-11 20:42:27.819 Info, Node057,     Node 22
2018-04-11 20:42:27.819 Info, Node057,     Node 23
2018-04-11 20:42:27.819 Info, Node057,     Node 25
2018-04-11 20:42:27.819 Info, Node057,     Node 28
2018-04-11 20:42:27.819 Info, Node057,     Node 29
2018-04-11 20:42:27.819 Info, Node057,     Node 30
2018-04-11 20:42:27.819 Info, Node057,     Node 31
2018-04-11 20:42:27.819 Info, Node057,     Node 34
2018-04-11 20:42:27.819 Info, Node057,     Node 35
2018-04-11 20:42:27.819 Info, Node057,     Node 40
2018-04-11 20:42:27.819 Info, Node057,     Node 41
2018-04-11 20:42:27.819 Info, Node057,     Node 42
2018-04-11 20:42:27.819 Info, Node057,     Node 43
2018-04-11 20:42:27.819 Info, Node057,     Node 44
2018-04-11 20:42:27.819 Info, Node057,     Node 45
2018-04-11 20:42:27.819 Info, Node057,     Node 48
2018-04-11 20:42:27.819 Info, Node057,     Node 50
2018-04-11 20:42:27.819 Info, Node057,     Node 51
2018-04-11 20:42:27.819 Info, Node057,     Node 52
2018-04-11 20:42:27.819 Info, Node057,     Node 53
2018-04-11 20:42:27.819 Info, Node057,     Node 59
2018-04-11 20:42:27.819 Info, Node057,     Node 60
2018-04-11 20:42:27.819 Info, Node057,     Node 63
2018-04-11 20:42:27.819 Info, Node057,     Node 68
2018-04-11 20:42:27.819 Info, Node057,     Node 69
2018-04-11 20:42:27.819 Info, Node057,     Node 72
2018-04-11 20:42:27.819 Info, Node057,     Node 73
2018-04-11 20:42:27.819 Info, Node057,     Node 74
2018-04-11 20:42:27.819 Info, Node057,     Node 76
2018-04-11 20:42:27.819 Info, Node057,     Node 86
2018-04-11 20:42:27.819 Info, Node057,     Node 87
2018-04-11 20:42:27.819 Info, Node057,     Node 89
2018-04-11 20:42:27.819 Info, Node057,     Node 90
2018-04-11 20:42:27.819 Info, Node057,     Node 91
2018-04-11 20:42:27.819 Info, Node057,     Node 92
2018-04-11 20:42:27.819 Info, Node057,     Node 94
2018-04-11 20:42:27.819 Info, Node057,     Node 96
2018-04-11 20:42:27.819 Detail,   Expected reply was received
2018-04-11 20:42:27.819 Detail,   Message transaction complete
2018-04-11 20:42:27.819 Detail,
2018-04-11 20:42:27.819 Detail, Node057, Removing current message
2018-04-11 20:42:27.819 Detail, Node057, Query Stage Complete (Neighbors)
2018-04-11 20:42:27.819 Detail, Node057, AdvanceQueries queryPending=0 queryRetries=0 queryStage=Session live=1
2018-04-11 20:42:27.819 Detail, Node057, QueryStage_Session
2018-04-11 20:42:27.819 Detail, Node057, QueryStage_Dynamic
2018-04-11 20:42:27.819 Detail, Node057, QueryStage_Configuration
2018-04-11 20:42:27.819 Detail, Node057, QueryStage_Complete
2018-04-11 20:42:27.820 Warning, CheckCompletedNodeQueries m_allNodesQueried=1 m_awakeNodesQueried=1
2018-04-11 20:42:27.820 Detail, Node057, Notification: NodeQueriesComplete

Startup took me 20+ minutes as always with these issues.

Now when it started up correctly (i removed the 2 dead nodes i had) after a restart it takes me now ± 10 minutes. Still have some nodes that are funky which needs some love, which might speed this up a bit more too.

So in the end, remove your xml file (or make a backup) start up fresh with ozwcp let it run to the end. Then check like you see in the image where it stopped, try to refresh that node where it didn’t have a name for it and see if its dropping commands. If that is the case you have a dead node but isn’t marked as dead. Then do the same / repeat untill you found them all. I had 2 (both where behind the same power supply that was broken)

I hope this helps some people debugging more since i had a hard time nailing this one. Feel free to ask if you need more information (or discord, had a lot tips/feedback from Tinkerer on discord which helped me too)

p.s. during a startup you should still be able to turn lights on/off although slow (this is normal since it is still scanning your network) if this doesn’t work already, then something is still wrong too.

Regards,
Erik (Riemers on discord Hass)

3 Likes

Another one was having a device in zwave control panel which shows just fine in the list but will not allow to show the entities. Even though those entities are existing in HA. As example my RGBW didn’t respond anymore. Do note, it was still showing “complete” and Test/Heal was still working.
See below example:

The only solution to fix this was to stop the zwave stack in HA. (Stop network) and then go to Entity Registry and remove all entities that where part of the device. Including the Zwave one.

Then i did a full restart and it came back in the list, and this time with 5 new entities alive and kicking!

I’ve added the device, i did an include but its not showing in HA in the zwave panel. When you have that and you look in your zwcfg xml file and see the last one added in the lines of this:

                <Manufacturer id="0" name="">
                        <Product type="0" id="0" name="" />
                </Manufacturer>

It usually means the device didn’t communicate everything. Waking up the device (if its battery powered) could help but if that doesn’t work you have to lookup the manufacturers ID.

(just remember, you could also just remove and re-add the device and try again)

In my case it was a Shenzhen Neo Electronics Co Ltd.
Looking at the openzwave github repo where they keep all supported devices you can look into the https://github.com/OpenZWave/open-zwave/blob/903c0c5bb65d1d72435e37de39297a7ec670b158/config/manufacturer_specific.xml file and search for it. A quick search trough this would show a

<Manufacturer id="0258" name="Shenzhen Neo Electronics Co Ltd">

And you can also find the id of the product

<Product config="shenzen_neo/nas-ds01z.xml" id="1082" name="Door/Window Detector" type="0003"/>

This is not something easy to find the product id, so you might have to play with that. Do note! Stopping the zwave stack in HA and starting it again usually doesn’t do the trick since the things are cached. A full restart is needed (like with most stuff that concerns Zwave)

Once restarted i was able to see and select my door sensor again.

I was in luck, another issue but also resolved. If you have a device that is spitting out a million messages per second which screws up your network and it shows information like this:

2019-06-30 21:08:54.593 Info, Node048, Response RTT 35 Average Response RTT 35
2019-06-30 21:08:54.594 Info, Node048, Max associations for node 48, group 3 is zero.  Querying associations for this node is complete.
2019-06-30 21:08:54.594 Info, Node048, Adding the controller to group 3 (Lifeline) of node 48
2019-06-30 21:08:54.594 Info, Node048, MultiChannelAssociation::Set - Adding instance 0 on node 1 to group 3 of node 48
2019-06-30 21:08:54.594 Detail, Node048, Queuing (Send) MultiChannelAssociationCmd_Set (Node=48): 0x01, 0x0b, 0x00, 0x13, 0x30, 0x04, 0x8e, 0x01, 0x03, 0x01, 0x25, 0x7a, 0x01
2019-06-30 21:08:54.594 Info, Node048, Get MultiChannelAssociation for group 3 of node 48
2019-06-30 21:08:54.594 Detail, Node048, Queuing (Send) MultiChannelAssociationCmd_Get (Node=48): 0x01, 0x0a, 0x00, 0x13, 0x30, 0x03, 0x8e, 0x02, 0x03, 0x25, 0x7b, 0x04
2019-06-30 21:08:54.594 Detail, Node048,   Expected reply and command class was received
2019-06-30 21:08:54.594 Detail, Node048,   Message transaction complete
2019-06-30 21:08:54.594 Detail,
2019-06-30 21:08:54.594 Detail, Node048, Removing current message

Constantly, then you have some bad luck too.

I tried to exclude it via HA, no luck. Tried to power down the device and start it again, no luck. The only thing i could do is use my Aeotec stick, go to the device and do a manual exclusion of the device. After i checked all was fine again i re-added the device and now it is back to normal. No clue as to why this matrix logging happened but it was stuck. Also tried a fresh config by removing my zwcfg but that also didn’t do anything.

So case solved on this one too, if i encounter more i might be able to write a book :wink: