All Zigbee devices suddenly become offline to HA at once

So there I was sitting comfortably. Perhaps too comfortably. Smug even. Thought I’d thought everything through. Gone from 0-50 devices in year in our new flat and performing pretty reliably.

Wasn’t sure what category to post this in. It might be my config. It might be hardware related. It might be O/S, so I plumped for “Configuration” as a catch all. Because for sure something spiked my zigbee network about 20-22 hours ago. Just like that. I don’t know the exact time unfortunatly because my automations only happen noticeably at night, when we realised that sunset must have been a while ago and the lights didn’t come on. That was about 10pm, but I thought it was just a one of those yygt glitches that happen from time to time. I took a look later and tried the usual reboot but then found that every device had gone offline.

I posted some detail here last night showing some of the error messages, but I have not become clever enough yet to know how to obtain and download all the relevant detailed logs to try to find the event responsible. I don’t think I got to them in time, although this seems to scream “culprit”
Your network is using the insecure Zigbee2MQTT network key!
which is odd because, although it’s installed, I have not yet had time to edumacate myself in the use of Z2M. Its not actually configured.

In addition to the error messages posted there, here is what might be a Z2M significant event:

Zigbee2MQTT:error 2023-07-01 22:02:47: Error: Failed to connect to the adapter (Error: SRSP - SYS - ping after 6000ms)
    at ZStackAdapter.start (/app/node_modules/zigbee-herdsman/src/adapter/z-stack/adapter/zStackAdapter.ts:103:27)
    at Controller.start (/app/node_modules/zigbee-herdsman/src/controller/controller.ts:132:29)
    at Zigbee.start (/app/lib/zigbee.ts:58:27)
    at Controller.start (/app/lib/controller.ts:101:27)
    at start (/app/index.js:107:5)

But not sure how to interpret

This is from Log Viewer:

Add-on version: 0.15.1
 You are running the latest version of this add-on.
 System: Home Assistant OS 10.3  (aarch64 / raspberrypi4-64)
 Home Assistant Core: 2023.6.3
 Home Assistant Supervisor: 2023.06.4
-----------------------------------------------------------
 Please, share the above information when looking for help
 or support in, e.g., GitHub, forums or the Discord chat.
-----------------------------------------------------------
s6-rc: info: service base-addon-banner successfully started
s6-rc: info: service fix-attrs: starting
s6-rc: info: service base-addon-log-level: starting
s6-rc: info: service fix-attrs successfully started
s6-rc: info: service base-addon-log-level successfully started
s6-rc: info: service legacy-cont-init: starting
cont-init: info: running /etc/cont-init.d/nginx.sh
cont-init: info: /etc/cont-init.d/nginx.sh exited 0
s6-rc: info: service legacy-cont-init successfully started
s6-rc: info: service legacy-services: starting
services-up: info: copying legacy longrun logviewer (no readiness notification)
services-up: info: copying legacy longrun nginx (no readiness notification)
s6-rc: info: service legacy-services successfully started
[22:02:15] INFO: Starting Log Viewer...
2023-07-01T21:02:19.313Z logview:debug start tailing /config/home-assistant.log
2023-07-01T21:02:19.328Z logview:info listening on port 4277 (HTTP)
[22:02:19] INFO: Starting NGINX...
2023-07-01T23:11:09.089Z logview:error 'change' event for /config/home-assistant.log. Error: ENOENT: no such file or directory, stat '/config/home-assistant.log'
2023-07-01T23:11:10.092Z logview:error watch for /config/home-assistant.log failed: Error: ENOENT: no such file or directory, stat '/config/home-assistant.log'

When I started this smart-home project a year ago, I wanted light and blind automation and some motion sensors. I had some prior experience having played with and enjoyed a SmartThings starter kit a few years back. My research led me to a focus on Zigbee, although I had a few wifi sockets already and had reliable experience with e-family cloud - Smart Life now I believe - at least thats how I have them integrated into HA as the brain.

The HA brain is installed on a Raspberry Pi 4B equipped with plenty storage and a Sonoff Zigbee 3.0 USB Dongle Plus. Theres a Sonoff Bridge to handle distance. The remaining devices are mainly IKEA Tradfri spots, and sensors. Linkind GU10 spots, Sonoff zbmini, Candeo dimmers, and more recent additions have been a ZY-M100 and a Somfy Connectivity kit via Overkiz. Curiously, the Somfy devices stayed connected and functioning throughout. Aside from Overkiz and (unused) MQTT integrations there’s Sonos, Tuya, ZHA (showing deug logging enabled I notice).

So nothing much exotic going on I would say and no great churn over the last couple months, other than the addition of Somfy Connectivity kit about a week before this event, and removal of a Tradfri repeater that seemed overkill alongside the Sonoff kit. A week when I upgraded to 10.6.3, and I may have noticed the odd wobble but nothing that I would consider a pre-cursor event on this scale. Like the England cricket team I am left stumped. It has seriously caused me to question my tech choice.

It has taken about six hours of rebuild to get about 80% of everything back. Worst thing is, I am the only person with the knowledge to put it all back together, what with some devices working only after power cycling one lighting and one ring circuit at the consumer board. Twice when you then find that some devices arent going to come back on board until you remove them from HA first. Other devices need you to remember whether they need buttons holding in, pressed for five or ten seconds, pressed two or four times, with or without holding the last one or, again, remove them from HA. Still havent magicked the smartthings plug back. The serious one is a zbmini wired into a hefty pendant lamp that’s going to need a couple ladders and people to get to the switch to reset it. That’s the one I really hate myself for configuring, but the only way to split a large lighting circuit without demolition.

So I am left with a trust issue. None of my research before or since suggested that such an event was possible. It “feels” like something happened to compromise the Zigbee network itself. Key change? Might Somfy be involved somehow? Zigbee2MQTT? Any insights into what might cause such a collapse, or other suggestions on how to rebuild trust much appreciated, otherwise I will start backing out and maybe introduce more PIR until something more reliable emerges in the marketplace.

Thanks for listening. You can wake up now!

1 Like

Ok stop and breathe.

The initial error was your HA instance was having trouble talking to the zigbee adapter in your machine. That’s all. It burped - it happens.

Second edit. I’m pretty sure you’re already too far down the path to recover prior work. But next time you probably do t have to do all this. It loos like a comms hiccup. For why unknown. But that’s all.

1 Like

Thanks Nathan. Would be great to know how it might be possible to recover to an earlier point with regard to zigbee. I tried restoring back to two previous versions but it didn’t recreate the zigbee network. All still offline. Something destroyed the zigbee network. I don’t like the sound of “common”. Its the first time for me. Don’t think I can live with a repeat sword hanging over me. Its the others in my home who will be left high and dry. Any such solution must be easy to recover from. And this isn’t. If it was common I think HA would not have such adoption.

1 Like

Don’t jump to conclusions.

What’s common. Losing connectivity to your zigbee coordinator dongle for various reasons. I think where you diverted was at this point. I saw where you oiselted you’ve rebooted and it didn’t fix it but did you completely power everything off? This has even happened to me. But in most cases unless the stick electronically failed all it requires is full power off (note power off not just a reboot. I’ve seen too many cases where a reboot alone doesn’t do it so when this happens I shut it all down count to 60, have coffee and cuss a bit and turn it back on.)

Once you started down the restore path I can’t tell you what happened because there’s so many variables. But many of the restore paths can be destructive to your zigbee network if you start messing with things.

You mentioned something about key change (that’s a destructive event - when did that happen?)

Thanks for taking the time to read Nathan. Yes I completely powered everything off, counted to maybe 20/30 as is my usual. All still offline. I have read that you can lose con to the coord at various points (upgrade might be an obvious one) and that’s why the mesh is supposedly so successful, but clearly not bulletproof. I have read many and various posts about the mesh carrying on (I have two repeaters) on losing contact with the coord. All of which is why this has spooked me. It shouldn’t be possible for the entire mesh of 50 devices, including the two routers (one sonoff and one ikea) to simply all go offline at once.

Unless - as you pick up on - the entire network has been switched in some way. Which is why this error stuck out Your network is using the insecure Zigbee2MQTT network key! and I saw this post which seems to indicate that there is a network key held somewhere and if you change it then you need to repair all devices. But what can cause that to happen, randomly, while I’m watching the cricket and days away from the last change to the system - unless it was when I used the somfy/overkiz integration to lower the blinds. But what is “the insecure Zigbee2MQTT network key!” I don’t even use Z2M. Added it in a way back in order to check it out, but seemed a tad advanced for me tbh.

Like I say, kicking myself that I don’t know how to marshal the logs into a safe place so I get a bit more than a day review, or know where to look for e.g. Overkiz malicious actions. Or any key change. Seems like an important thing to protect.

Maybe theres an instruction somewhere that can allow you to lock in your network key so it can never be changed without explicit action/acceptance, although given how catastrophic the result is, I dont know why that wouldn’t be baked in?

Even if it is the cause. The restore path was the simple one in the HA interface that allows you to select the last version and restore it, but it certainly doesnt restore the network key.

Thanks again, will do some more key reading
g
Edit to say perhaps I should remove any Z2M add in, integration or reference to it - then I will know i am ddefinitely not using it - although to be clear, I dont know how to use it.

This is also particularly common with the Sonoff sticks. There’s at least two known bugs:

  1. QC issues with the NVRAM where it’ll corrupt itself and stop working. With Z2M simply restarting clears this since Z2M restores a backup to the stick every time it starts.
  2. Some other issue where the stick crashes when the mesh is “too busy”. Simply restarting Z2M clears this issue.
2 Likes

Hi Tinkerer. So you think I shiould just dive into Z2M and all will be fine? I have teetered on the brink and see a lot of people happy doing it, but worried that its an extra level of complexity for e.g. those that come along after me. Not wishing myself dead or anything, but I promised the boss this would be a self-healing experience when I’m not here. Just switch it off and on. Perhaps self-delusionary…
I was wondering whether to factory reset the sonoff in the hope that it then acquired the remaining “offline” devices - particularly those buried in the ceiling/behind wall
Thanks for responding!

More that the Sonoff stick is known to have problems, though the common ones are handled by Z2M. A different coordinator may give you less problems.

Anyway, you said you’re using Z2M already.

That’s rarely a good thing to do to a running Computer.

Either way, when it happens you should do an actual investigation into the problem rather than diving for a restore.

Thanks Tinkerer, and this is part of that investigation. I will try to investigate with Sonoff whether this is a known problem, but for sure it hasn’t jumped out at me in all the hours I’ve spent scavenging for answers on the web. I agree re hard stopping any in-flight computer carries risk, but that definitely wasn’t any way near the cause. As per above, i exhausted all analytical options available to my tiny brain, before the last course that hadnt been tried was to switch it off. I havent been able to find in my HA instance a shut-down option, so sadly it had to be the big red switch.

Anyway, it didn’t work and I was wondering whether anyone in here might know if factory resetting a coordinator might restore the original network key, such that my two hard-to-reach devices would become online again.

It won’t.

The network key is defined by the software, not the stick.

If you’re not using z2m, remove it. Even if you haven’t configured it, it still has a default config.

I suspect at some point during a restart z2m was able to grab and reconfigure the stick, hosing the ZHA net.

Thanks Jerrm. I took a look at the config earlier and armed myself with some reading around Z2M, before deciding to do as you suggested and remove all of the components installed over the past year, including Mosquitto, zigbee2mqtt addons and MQTT integration. Wouldn’t you believe it, all my devices went offline again. I imagine now I might have done the same last time. In some way Z2M is implicated.

Found this thread on github where the same bug has affected someone else, but I cannot contribute and the thread has been locked except to collaborators. The poster here says it continues even after a coordinator firmware refresh. Now I am at a crossroads, either:

1, reinstall, configure and try to use Z2M, in the hope that I can create a secure network key from scratch and somehow make contact with my offline zbmini without a lot of time and expense taking off the ceiling lamp, or

2, give up on acquiring sufficient Z2m knowledge, shut the lid on this jumanji magick, and carry on using HA - which depends on my finding out how to remove the rogue network key - and hope stability continues as it has up until now.

Will find out tomorrow if that “insecure” message reappears in the log - as it does about twice a day.

I had this exact same issue over this past weekend. My setup is initially to drive heating, with the radiators and boiler control driven by separate temperature sensors, and a handful of light switches that I’ve added since. I have a mixture of wifi and zigbee devices, and this weekend, all ZB devices went offline. After a reboot, there was still nothing. As a last resort before calling for help, I tried a cold start (unplugged the pi for 30 seconds and plugged it in again).

After HA had restarted, I could access most devices. I could see that some devices on the mesh had clearly glitched out previously, because the network was horribly unbalanced, but it has since righted itself, as Zigbee is designed to do.

But this did require a full power cycle, which is not something I could do remotely, and the rest of the household has barely enough technical knowledge between them to turn on a light switch.

Question: Is it possible to power cycle the USB port from within HA from an automation, which can do a periodic health check on the ZB devices and power cycle the ZB stick if it fails? My configuration is identical - R Pi 4B with Sonoff Zigbee Plus 3.0 stick.

Did you try to just unplug the coordinator, which is also like a power-cycle for USB.
This is what you can show the rest of the household to do in such a situation.