Monthly complete breakdown of Zigbee infrastructure

Since September, I have a relatively regular complete breakdown of my Zigbee infrastructure every 30 days.
All devices are shown as offline and don’t respond to any Zigbee commands.
Being fairly good in analyzing logs and using HomeAssistant for almost three years now, for this problem I struggle to find any hint why this might be the case.
The only thing which helps is a restart of the whole (physical) server, so I believe it might be some hardware/usb/serial issue.
How can I investigate further?

Setup:
Proxmox VM environment on an Intel Nuc i3
Linux Host OS
Sonoff Zigbee TI CC2652P + CP2102N

HomeAssistant VM:

  • Core 2024.11.2
  • Supervisor 2024.11.4
  • Operating System 13.2
  • Frontend 20241106.2
  • Z2M 1.41.0-1

Have you looked through the Z2M logs?

Yes, nothing helps me to narrow down the problem. All devices show either

Data request failed with error: 'undefined' (25)

or

Data request failed with error 'SRSP - AF - dataRequest after 6000ms'

Here is the complete log file

Looking through the logfile it looks to me like a faulty coordinator failing to connect to your end devices/routers.

Do you use an extension USB cable between your HA server and the coordinator? If not I would start by adding an extension cable to put the coordinator as far away from your HA server as possible to rule out interference problems.

What coordinator is used?

It has a 3m extension cable. Coordinator is a Sonoff device, see my original post

What firmware version is your dongle running on?

I had coordinator disconnect problems when upgraded to OS 13.x. I downgraded to 12.4 which fixed my problem. Rebooting provided a temporary fix.

At the time I read on the forum there was an issue with the USB drivers in 13.x

1 Like

Coordinator-Typ
zStack3x0

Coordinator-Version
20210708

I’d defo try to update your coordinator first. 2021 was quite some time ago…
The current version is 20240710.
The update is quite simple process.

You can download latest FW HERE

Texas Instruments Flasher HERE

This flasher HERE works too

You can follow any of THESE YouTube tutorials

1 Like

Will try that, thanks! You’re completely right, it’s quite old.

No prob. Of course, I’m not saying it’s 100% fix for your issue, but it is definitely a starting point…

You don’t need to be worried, you won’t need to re-pair your zigbee devices, they’ll be there after upgrade.

One more pro-tip: Most, if not all, of those videos will tell you that to enter the flash mode you need to press and hold boot button as you’re sticking in your dongle and hold it for about 10 seconds. When I was doing it that did NOT work at all. What did the trick for me was to stick the dongle in and immediately start pressing (not holding)the boot button for about 10-15 sec. This can save you some headache :wink:

Haha thanks :slight_smile:
Actually it worked by holding and sticking it in. First I had to find it, wasn’t aware you can open the case :smiley:
I’m running successfully now on 20240710, let’s see if this works better. If not I will try @beaj’s solution by downgrading HAOS.

lol, of course it’s just mine dongle that must have been special… I was losing my mind when I was doing it, was plugging, unplugging it like an idiot for an hour :smiley:

If you don’t mind, post an update here in a few weeks. You might help to someone else…

1 Like

Just started having similar issues a week or so ago… I think it started after 2024.11.2 ?

All ‘powered’ devices going offline, but battery powered ones remain connected, like motion sensors etc. Reboot of the nuc it self fix the issue for a few days, then it starts bugging out, usually during the night, so it’s pitch black in the morning :stuck_out_tongue:

Currently I have had no issues since the coordinator update. Did you try it as well @MistaWu ?
However it’s not long ago, so there might be issues as well in the future. Will keep this thread updated!

Not yet, did a restore back to 2024.11.1 today just to try… But I suppose a firmware update of the sonoff couldn’t hurt, just remember it being a little tedious.

I don’t think HA Core is the problem here. Maybe try a downgrade of HA OS (the underlying OS of HomeAssistant), like @beaj did

Just to expand on things a bit.

I run 3 instances of Z2M, each on its own RPi3B using HAOS - I have a lot of zigbee devices and I have found once I go over ~ 90 devices on a zigbee network things become more unstable. I also have zigbee devices that don’t play well together or point blank refuse to connect to another zigbee network.

Around August(ish) I upgraded to the latest HA release that included OS 13.x and I had all my Hue lights (which are on a single Z2M instance) disconnect because the coordinator was not seen on the USB port. Reboots, Power cycles etc did not help. A bit of digging on the forums suggested there was an issue with the USB drivers on 13.x so I downgraded to 12.6 (or 12.4, can’t remember) which seemed to solve the issue.

Since then I have upgraded that machine to Core 11.3 and OS 13.2 and it looks to be stable so I’m not sure if and USB driver issue may have been resolved.

Edit:

Scrub that, I was looking at another machine. It is running OS 12.2 and Core 12.5.3. I treat it as an appliance so don’t routinely update it.

~B

yay… flashed firmware of the sonoff stick… now z2m is stuck on this and wont start:

[2024-11-29 09:48:45] error: 	z2m: Error while starting zigbee-herdsman
[2024-11-29 09:48:45] error: 	z2m: Failed to start zigbee
[2024-11-29 09:48:45] error: 	z2m: Check https://www.zigbee2mqtt.io/guide/installation/20_zigbee2mqtt-fails-to-start.html for possible solutions
[2024-11-29 09:48:45] error: 	z2m: Exiting...
[2024-11-29 09:48:45] error: 	z2m: Error: network commissioning timed out - most likely network with the same panId or extendedPanId already exists nearby (Error: AREQ - ZDO - stateChangeInd after 60000ms

Any ideas? Tried adding this to the config as, but doesnt seem to do anything:

advanced:
  pan_id: GENERATE
  ext_pan_id: GENERATE
  network_key: GENERATE