Diagnosing why nodes go dead

Over the last few weeks I have noticed that random nodes go dead at random times. Manually pinging them brings the node back immediately. I’ve done a few heals, upgraded the firmware on the USB stick, HA is of course latest version.

What troubleshooting steps can I take to try and solve this issue? Is there a good debug guide? Things to look out for?

That’s a good question. I think there are 3 classes of problems.

  • Stick Issues - outdated firmware, RF interference, low USB voltage, faulty USB cs led, etc, that cause it to behave badly.
  • Poor mesh - not enough nodes to have good consistent coverage, badly behaving nodes
  • Too much data - sensor updating too frequently, too many commands being sent to quickly.

A place to start is going through the zwaveui and look at the statistics for the controller and every node - is your error rate higher than 0.1% or are there specific nodes that have higher rates?

Second piece is to implement some monitoring. If a node is supposed to send a temperature update every 10 minutes - is that happening? For nodes that don’t periodically update (like switches), poll those periodically to force an update. This allows you to detect problems ahead of time.

Recently I’ve been working on reliable service calls that measure how long it takes the command to complete. So if a node usually responds to a switch_on in 80ms, and it starts taking 500ms - something has changed.

I can post repo links for both if needed.

Great info thanks. I think I can rule out the poor mesh. My USB stick is in a centralized location and 3 or more relay devices per room.

As for stick issues, I did update to latest firmware about a week ago. I’m leaning more towards interference or failing hardware since it was working flawlessly for over a year. This morning I did a refresh/re-plug of all hardware. Full disclosure, in order to get my USB stick in the middle of the house I am doing USB over powered ethernet, so that can be failing.

I will monitor zwaveui to see if I can find any offenders. It would be a nice feature if that information bubbled up. Sounds like that is what the service calls you are working on. I’d be happy to take a look at any those repo links.