Best way to troubleshoot ZHA losing all devices?

My ZHA integration from time to time starts showing all of my devices as unavailable. The time involved is unpredictable and can be anywhere from a couple hours to several weeks. When this happens, all of my devices are lost at the same time. The only cure I have found so far is to completely power off the server and power it back on (simple reboot is not sufficient). After that, all of the devices are recognized again.

As a software guy, I know the best advice is to disable everything else and reproduce the problem. That’s not really feasible in this case without losing the use of my HA server. What’s the best way for me to troubleshoot this?

I’m running current HAOS on generic x86-64 hardware. I’m using ZHA and a SkyConnect dongle. I have the Silicon Labs Multiprotocol addon installed (I once tried to uninstall this to help troubleshoot, but it created an error that I can’t recall offhand).

I turned on ZHA debug logging recently, and I was “lucky” enough to have the problem happen after less than a day. Alas, the log file is over 500mb. Even after weeding out a bunch of stuff that I know is not relevant, the log is still over 100mb.

So, what I am looking for is hints about what to look for in that log file. Suggestions?

Have you read this? Zigbee networks: how to guide for avoiding interference + optimize using Zigbee Router devices (repeaters/extenders) to get a stable mesh network with best possible range and coverage

1 Like

Thanks for the suggestion. Yes, I have read that.

That’s mostly about why some subset of devices aren’t reliably on the network. In my case, it’s 100% of the devices being lost at the same time, and they all come back after a server power cycle. So, I’m skeptical that any of the advice in that article apply here.

How is your network built? How many routers, end devices, are you using an extension cable etc.

1 Like

The SkyConnect dangles in mid-air from an 18 inch USB extension cable.

Besides the SkyConnect, there are 20 other Zigbee devices. Two of those devices are smart outlets that act as routers, and both of those show up as directly connected to the SkyConnect. (In fact, that’s why I have those smart outlets. I don’t use them for the Zigbee control of the outlet.)

At the moment, a few of those 20 devices are legitimately offline (due to dead batteries and a lazy operator), but I had this problem even when they were all operational.

So you only have two routers and the coordinator, the rest are battery powered? If that is correct I suspect you have too few routers and that’s causing your issues. You probably need more routers to have a healthy network with that many battery devices. I have 29 battery powered ones and 42 routers for comparison.

This is explained in detail under the section " Network optimization (optimizing for mesh networking)" that I linked earlier.

1 Like

There’s a lot I don’t know about Zigbee, so don’t take it the wrong way when I ask what your thinking is here. 18 battery devices sprinkled among 2 routers and the coordinator doesn’t seem like a very big number or network to me. I wonder what might be overwhelmed by that. Do you think it’s an RF thing or a protocol thing?

Do some searches on the forum, this is a classic Zigbee problem. 99% sure that your problem will be solved by adding more routers.

And as I said this is explained in detail in the link :blush:

Power plugs or bulbs are an option, you also get dedicated routers like the IKEA ones.

1 Like

Thanks for taking the time to respond. Back to my original question, do you know of anything I can look for in the debug log that might shed light?

@fleskefjes is right. Your two routers and the coordinator will be handling messages to all 18 battery-powered devices. They may be capable of that in theory, but there will be no alternative paths for messages to take - in the case of interference, for example.

Normally you would start by building a robust network of routers and add the battery devices afterwards.

At the moment you might expect to see errors relating to message delivery failures after multiple attempts, or to timeouts.

If you want to see how much interference there could be on the channel you’re using you can download diagnostics from the SkyConnect. Towards the end of the report will be a bit like this:

    "energy_scan": {
      "11": 52.75969252664325,
      "12": 88.70042934643088,
      "13": 84.164247274957,
      "14": 15.32285793082191,
      "15": 52.75969252664325,
      "16": 31.01324838787301,
      "17": 80.38447947821754,
      "18": 82.35373987514762,
      "19": 15.32285793082191,
      "20": 80.38447947821754,
      "21": 1.9464625152460222,
      "22": 2.84844209578687,
      "23": 46.26944564832987,
      "24": 1.5075412082833717,
      "25": 2.2107128772756957,
      "26": 52.75969252664325

The percentages represent everything on the channel - Zigbee, your wi-fi, your neighbour’s wi-fi etc, etc.

But the problem is almost certainly too few routers. How many you need will depend entirely on the structure and layout of your home, but I would expect it to work out at a dozen or more.

Incidentally, you should expect all your end devices to be unavailable sometimes - after a restart, for example. The coordinator has to wait for them to check in, which can take an hour or more with some devices.

Do read the community guides - there’s lots of sensible stuff there. Zigbee is not a simple point to point thing - it can take a lot of tuning to get it right.

1 Like

Thanks for the suggestion about energy_scan. Mine looks way different from yours.

    "energy_scan": {
      "11": 0.3649532476334485,
      "12": 0.6967547825628676,
      "13": 0.792717332355823,
      "14": 0.5380922496244791,
      "15": 0.41540864658928767,
      "16": 0.5380922496244791,
      "17": 0.6123372955913717,
      "18": 0.6967547825628676,
      "19": 0.792717332355823,
      "20": 0.2816331001848671,
      "21": 0.3649532476334485,
      "22": 0.3649532476334485,
      "23": 0.2816331001848671,
      "24": 0.24738567181200594,
      "25": 0.2816331001848671,
      "26": 0.24738567181200594
    },

I did two scans a couple hours apart, and they are similar. I’m not sure what conclusion to draw from my numbers.

You are both suggesting that I need on the order of a router or more per battery device. That’s different from what I was naively expecting. If it’s really true, I’d be more likely to just punt these Zigbee sensors and go back to the wifi sensors I was using previously.

Thnaks again for your input.

1 Like

0 on every channel? I doubt you’d get that unless you’re living in a concrete bunker without any WiFi.

Double check the coordinator antenna for hardware issues. At the very least you should be getting a higher number on the channel your ZigBee is on.

Edit: just spotted you’re using multiprotocol which is no longer recommended due to instability issues like this. You might need to flash ZigBee only firmware back on your dongle

3 Likes

Looks like we were wrong… :face_exhaling:

2 Likes

Hey I took a 1% chance that I might be wrong! :smiley:

Even if this is caused by a faulty stick I still stick with the recommendation of more routers though :slight_smile:

3 Likes

Yeah you should not bw using the Multi-PAN RCP / multiprotocol firmware. Disable multiprotocol and flash the EmberZNet NCP firmware instead. Buy a seperate radio dongle for Thread protocol.

That is not so, it does not sound like you read and understood it as it also covers best practices and actions that everyone should take regardless of setup, and if you now only have two Zigbee Router devices then you are not following those best practice tips, because Zigbee relies heavily on Zigbee Routers and mesh networking (which battery devices can not do on their own). It also mentions that multiprotocol firmware is not recommend and that should use a dedicated radio dongle with NCP firmware for Zigbee-> Zigbee networks: how to guide for avoiding interference + optimize using Zigbee Router devices (repeaters/extenders) to get a stable mesh network with best possible range and coverage

1 Like

That energy scan is 100% weird. This is mine:-

    "energy_scan": {
      "11": 1.0256846852618655,
      "12": 2.509919386096536,
      "13": 2.2107128772756957,
      "14": 2.84844209578687,
      "15": 82.35373987514762,
      "16": 1.0256846852618655,
      "17": 2.84844209578687,
      "18": 6.789392891308996,
      "19": 85.82097888710312,
      "20": 75.96022321405563,
      "21": 82.35373987514762,
      "22": 75.96022321405563,
      "23": 85.82097888710312,
      "24": 84.164247274957,
      "25": 96.64469941013013,
      "26": 68.14622793558128

You can clearly see my zigbee network on channel 15, and my 2.4 wifi on the higher channels

Thanks for all of the recent input. It definitely gives me some things to try. I don’t want to sound defensive, but it seems like some of the advice about interference and distance and such didn’t pay attention to the fact that it’s all of the devices coming and going at the same time. It’s not just some subset with marginal connections. So, maybe you didn’t catch that detail, or maybe you were thinking that it could happen that way with a non-optimal network.

0 on every channel? I doubt you’d get that unless you’re living in a concrete bunker without any WiFi. Double check the coordinator antenna for hardware issues. At the very least you should be getting a higher number on the channel your ZigBee is on.

All I can say is that I’ve dumped those diagnostics multiple times, and they are always in that ballpark. My Zigbees are using channel 11. The SkyConnect dongle does not have an externally visible antenna.

Yeah you should not bw using the Multi-PAN RCP / multiprotocol firmware. Disable multiprotocol and flash the EmberZNet NCP firmware instead.

Fair enough. I will give those a try. I currently only have Zigbee devices, no threads. I’ve avoided stuff like this so far because I’m not sure what will trigger the need to re-pair all my devices. A couple of them are in inconvenient locations. The advice about the firmware seems solid enough that it’s worth it even if I have to re-pair.

if you now only have two Zigbee Router devices then you are not following those best practice tips

First, let me say that’s a great guide, and I appreciate the effort that went into it. I’ve just re-read it specifically to see what it said about routers. The graphic shows a lot of routers, and in one place the text recommends a “swarm” of routers. On the other hand, it also says “Personally, I suggest buying and adding at least three such devices.” Well, 2 is not so very different from 3 :slight_smile:.

If you really need about as many router devices as battery devices (leaving aside the need due to distance or walls or interference), then maybe Zigbee is not for me. All of my “real” devices (mostly contact and motion sensors) are battery powered.

That energy scan is 100% weird

I wonder if SkyConnect isn’t the best coordinator to use. I got it because I figured the HA people behind it would have figured out all the best tricks and so on. Or maybe part of the firmware situation also leads to these strange energy scan results. Beats me.

Thanks again to everyone for the input. I’m off to update my firmware, etc, and will report back with the outcome.

Ah, missed the Skyconnect part. I’d still suspect a dodgy antenna, but first I would flash it with zigbee-only firmware. That’s the only way to narrow down whether it’s a software issue or a hardware issue.

I’m back to the Zigbee firmware on the SkyConnect. Now the wait to see what happens (while hoping it doesn’t happen … how do I prove a negative again? :slight_smile:)

BTW, I went down some wrong pathways for the firmware flashing before I finally discovered the very convenient built-in described here: Home Assistant Connect ZBT-1 (Disable multiprotocol support)

1 Like

Good luck! Out of curiosity, now that you’re on zigbee-only firmware, what does the energy scan look like?