Troubleshooting unusual/unreligable Zigbee network

I’ve been running a home assistant setup for a few years now, but in the last 6 months we’ve been finding our Zigbee based devices starting to behave quite erratically to the point we no longer trust it but are at a loss on how to identify the root cause of the problems or how to proceed.

The network was in place for over a year and was functioning well, then in the last 5-6 months for no obvious reason it started to misbehave to the point we are getting desperate.

It started with the occasional delay on an action taking place, For example, sending a light.turn_on could take 3 to 15 seconds to happen. Often the whole network will stop responding to instructions, then execute all the queued up requests in rapid order (such as turning lights on/off/back on etc)

The behaviour is somewhat erratic, sometimes the system will work for days/weeks on end without fault then will start to be erratic at apparent random

This seems to be now esclating to new, stranger behviours like some lights dimming in a loop or flashing/blinking apparently at random (e.g. one bulb in a group of five will just start ramping up from 0 brightness repeatedly) often it is a different bulb in a different room. We do not have any automations or behaviours assigned to single bulbs, everything is attached to a Zigbee group of bulbs in ZHA.

In the last week, We’ve had a small number of devices (a Ikea Smart-plug and a Philips Hue bulb fall off the network and had to be reset and paired again) which is new troubling behaviour.

What I’ve tried

  • Checked for interference / Tried other channelsUsing the AP scanning function on my UnifiAP (2 AP’s in the property, running on Chanel 1 and 6) to get channel usage which shows that most channels (and Zigbee Channel 20, 2450 MHz, which overlaps with Channel 8 on Wifi) is showing low utlization/noise, as this is the only real tool I have to check this)

  • Moving the coordinator to another part of the house, and made sure that it is away from any other electronic device or the Access points

  • Adding an additional router to boost signal on the other side of the house

  • Rebuilding the Zigbee network fully with a factory reset and repairing of each device in place.

  • Replacing the Cordinator from a ElectroLlama to a SBLZ-06

  • Replacing Zigbee2MQTT with ZHA

  • Rebuilding Home Assistant fully

  • Moving HA to a more powerful host deviceWe thought for a little while that rebuilding HA had helped with the issues but it is unclear if this is the case or not

The new host is sually sitting around 12% CPU usage and about 10% ram so it does not seem to be taxed in any shape or way. Looking at the Zigbee network map in ZHA (attached) we can see most of the connecting lines are ‘yellow’ and occasionally green. (edited)

(The red device is one of the Ikea Smart-plugs we have not gotten around to re-pairing back to the network)

The network itself is composed of 72 devices, with the following breakdown:
33 Philips Hue bulbs (Mains Powered)
6 Ikea ‘Tradfri’ bulbs (Mains powered)
6 Philips Hue Remotes (Battery)
2 Ikea Tradfri Remotes (Battery)
5 Xaomi Miliwave motion detectors (Mains powered USB)
5 Xaomi Aquaia Door sensors (Battery)

And this is spread over 3 floors in a normal sized UK home (i.e., fairly small) (edited)

I’ve been trying to collect debug logs from HA and the SBLZ-06 zigbee controller, even from just today but honestly Im not sure exactly what is relevent, but please find attached some snippets from HA/Coodinator that keep cropping up

HA logs, this error shows up several times a day:

Logger: homeassistant
Source: /usr/src/homeassistant/homeassistant/runner.py:112
First occurred: 03:18:26 (10 occurrences)
Last logged: 13:24:11
Error doing job: Task exception was never retrieved (None)

Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1120, in request_callback_rsp
    return await callback_rsp
           ^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1117, in request_callback_rsp
    async with asyncio_timeout(timeout):
               ~~~~~~~~~~~~~~~^^^^^^^^^
  File "/usr/local/lib/python3.13/asyncio/timeouts.py", line 116, in __aexit__
    raise TimeoutError from exc_val
TimeoutError

And often this error as well

2025-05-06 11:01:40.841 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved (None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1118, in request_callback_rsp
    await self.request(request, timeout=timeout, **response_params)
  File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1079, in request
    response = await response_future
               ^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1117, in request_callback_rsp
    async with asyncio_timeout(timeout):
               ~~~~~~~~~~~~~~~^^^^^^^^^
  File "/usr/local/lib/python3.13/asyncio/timeouts.py", line 116, in __aexit__
    raise TimeoutError from exc_val
TimeoutError
2025-05-06 11:01:40.843 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved (None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1120, in request_callback_rsp
    return await callback_rsp
           ^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1117, in request_callback_rsp
    async with asyncio_timeout(timeout):
               ~~~~~~~~~~~~~~~^^^^^^^^^
  File "/usr/local/lib/python3.13/asyncio/timeouts.py", line 116, in __aexit__
    raise TimeoutError from exc_val
TimeoutError

We seem to get a lot of these errors on the log in the SLZB-06 but the time-stamps dont seem to line up with the errors in HA so we dont know if they are related or not?

[10:31:04] zb_packet | wrong paket len: 2 expected: 4
[10:41:43] zb_packet | wrong paket len: 2 expected: 4
[10:52:29] zb_packet | wrong paket len: 57 expected: 202
[11:21:13] zbSelfOta | Heap: 2684
[11:29:47] zb_packet | wrong paket len: 5 expected: 70
[11:39:45] zb_packet | wrong paket len: 5 expected: 49
[12:21:13] zbSelfOta | Heap: 2684
[12:23:59] zb_packet | wrong paket len: 2 expected: 4
[12:40:58] zb_packet | wrong paket len: 2 expected: 4
[13:05:23] zb_packet | wrong paket len: 2 expected: 4
[13:05:23] zb_packet | wrong paket len: 13 expected: 14
[13:21:13] zbSelfOta | Heap: 2684
[13:24:45] zb_packet | wrong paket len: 5 expected: 164
[13:39:03] zb_packet | wrong paket len: 2 expected: 4
[13:40:02] zb_packet | wrong paket len: 5 expected: 176
[13:48:00] zb_packet | wrong paket len: 43 expected: 198
[14:21:13] zbSelfOta | Heap: 2684
[14:56:40] zb_packet | wrong paket len: 5 expected: 43
[15:21:13] zbSelfOta | Heap: 2684

And I’ll attach a copy of the Debug logs from ZHA from when we managed to capture it while the behaviour was happening but the logs are extensive! I’ve had to upload it to GoogleDrive as its too large to share on pastebin etc.

I’m not sure that this will help? I have a Pi 4 with HA (Home Assistant) and I moved its operating system (OS) onto a larger Kingston Drive about two years ago. I use the CONBEE II USB coordinator which I’d had plugged directly into the back of the Pi 4 USB Port, right next to the Kingston drive. Initially it caused interference with my Zig-Bee devices configured this way. They dropped out or wouldn’t connect at all? I’m using ZHA to connect to my Zig-Bee devices, not MQTT. I have 28 devices connected, TRADFRI IKEA Bulbs, Aqara sensors and their “Cube” (LOVE IT), and various Amazon Zig-Bee plugs. My house is about two floors and upstairs to my bedroom is about 50ft through walls. My router is downstairs in my studio, in a corner of the house. So, lots of my Zig-Bee devices are rather far from my router. The Pi is on my studio desk about 3ft from my main router, with the USB CONBEE plugged in, now off of a reliable 3ft USB extension cable. That’s to give you an idea of my set up. Anyway, when I initially plugged the Kingston drive right next to the CONBEE plugged directly on the USB all of my devices dropped out. To fix that I tethered the CONBEE onto a reliable USB cable away from the Kingston drive and everything connected immediately. I’ve had no problem since. Could it be a hardware issue that is giving you difficulties? I’ve found the CONBEE II coordinator by Dresden Elektronik to be very reliable and again using ZHA rather than MQTT. By the sounds of you difficulties you may consider the hardware you are using. I had similar difficulty initially when I installed the Kingston drive right next to the CONBEE II coordinator. Once I moved it away from the Kingston drive, it hangs off the USB cable behind my studio desk, everything connected. Could your problem be a USB cable? Some don’t work well as they only provide power rather than reliably transfer data. Maybe check that? Could it be the SBLZ-06 Zig-Bee controller? The CONBEE II I have works very well. I’ve included a picture of my Pi connected to the Kingston drive with a “tethered” USB cable to the coordinator. Hopefully this may give you some insight into your difficulties? Sorry this is all I can think of in relationship to a similar problem I initially had. I realize it’s frustrating so possibly get just one device working reliably such as the TRADFRI IKEA Bulbs, which work really well in my house. You mentioned in your post that you were using these bulbs. Connect one in a lamp as close as you can to your HA and once it’s connected test that by moving the lamp to different locations in your home. I’m under suspicion that it may be your hardware rather than software? Sorry that you are having such difficulties. I don’t know if any of this you haven’t already tried but this is the best I can recommend. All the best and please let us know if you find a solution.

Hi,

At this point I dont think its a hardware issue, at least not with the Zigbee controller, as I’ve been running a ElectroLlama ZZH (Which is based on the recomended CC2652R1 chipset.

But to narrow it out I replaced it with a SLZB-06 which runs the CC2652P chipset.

From all my reading these are considered to be decent chipsets by all accounts, I’ve also ensured the devices are on extenstion cables (The SLZB-06 uses Ethernet + POE)

So its a good thought, but I dont think this is the case Im afraid.

The mmwave sensors might be flooding your Zigbee network.

2 Likes

Is there any way I can check to see if that is the case?

I’m not too sure if this is the issue however, as these issue started long before the mmwave sensors were added to the network (they have only been in place for about two/three weeks) while the issues have been ongoing for months.

Do you allow unifi to auto optimize the wifi?

That zigbee 20 with an active wifi 6 and auto optimization bothers me.

I had to cut 11 out of my wifi entirely and moved my zigbee to 25 another pattern that works is all 11. (zigbee 11 is on top of Wi-Fi 1, so they don’t interfere)

This sounds very similar to what mine was doing when unifi ‘optimized’ my wifi on top of my zigbee…

The Unifi is set to a manual channel configuration. Its currently using Wifi channel 1 and 6 for the two AP’s in the property and the Zigbee network is on the ‘default’ channel 20.

The ‘scan’ showed that the overlapping channels for the two are quiet (or at least, the Unifi considered them to be) From what I can see The Wifi-channels 6-11 are pretty empty around here? So moving the Zigbee to a higher channel wouldnt really hurt anything, but I’m guessing it would need all the devices to be re-paired back to the network (In which case, thats very much a weekend task!) :smiley:

Like @francisp, my first thought were possible flooding from the motion sensors, some of them are very noisy with multiple updates per second.

Second thought is wondering if all the bulbs are continuously powered. Physically cutting power at a wall switch can cause problems when the bulbs go offline.

I would probably try to eliminate devices one by one and see if you get to a stable point.

Some devices just don’t play well together. Overall, I’ve been blessed with stable nets, but the worst problem I had was when I introduced a Sonoff Plug into the network. All sorts of gremlins crept in. I can’t say it was the sonoff plug specifically, but stability returned when I replaced the sonoff plug with a sengled plug.

I hate to say it was the sonoff’s fault, it might be another device at the root. A lot of folks use sonoff plugs happily, and I have had it on a smaller test net without issue. However, it was obvious that for the specific mix of devices on my primary net there was some sort of incompatibility.

I’m unsure how exactly I would check to see if it is a case of the motion sensors causing the problems, (although to be clear, this has been going on long before the motion sensors were added to to the network). Im happy to consider this a cause but … how would I know? Is that controlable or measurable in some way?

The bulbs are always switched on (the actual switches controlling the power to the bulbs are not easy to access, so they always recieve power)

I can try removing some devices but Im sure you can understand things like unplugging lightbulbs isnt something I can do long-term without getting people very cross with me…

Most of the network is either Hue, or Xiaomi devices, I could remove the door-sensors (But given they are battery powered, would they be causing problems?)

Presumably the logs / errors Im seeing in HA are part of the problem and should ‘stop’ (as this would be a more obvious sign of an improvement rather than waiting for a bulb to freak otu or the network to be slow)

If it is a problematic device would simply powering it off be enough or would it have to be removed from the network?

UniFi does not scan for Zigbee. It is only looking at Wifi

1 Like

Damn! I just wasn’t sure from your original post. Of course everyones set up is different but I thought I may have some insight as to difficulties I’ve had with mine. POE, huh? Jeez, I’ve never powered anything with Ethernet before? Then again I haven’t needed to, could that be the difficulty? Sorry Slychocobo that I couldn’t give you any better direction. I have the IKEA bulbs also and they paired REALLY well, so those do work great. Thanks for getting back to me. Please let us know how you finally come to resolve the difficulty. Have a great day pal.

1 Like

Well, I’ve discovered that HA can give an ‘Energy Scan’ of the Zigbee channels, going to try and collate a few days worth of data, see if it provides any more useful data on how busy each channel is.

Hopefully it will help.

Listen to Nathan.

Zigbee channel 20 overlaps with wifi ch6. Change to zigbee ch 25.

See: ZigBee and Wi-Fi Coexistence | MetaGeek

Also set your wifi AP’s 2.4Ghz radios to Medium transmit power. Not High or Auto.

1 Like

I’m just collating a few days worth of data from grabbing the ‘diagnostic’ .json file from ZHA every 30mins and poking it into a spreadsheet via a script to see if that gives a better picture of the network.

Given changing the channel is likely to be several hours of work re-pairing all the devices I want to make sure this is going to be an effective change.

Would I be correct in assuming that lower-numbers in that are better? (Less traffic/noise)?

If you’re talking about the output of the energy from ZHA. Yes.

But honestly I’ve never found that tool to be much help. It snapshots a point in time. That varys wildly.

Most devices should take the channel change just fine. Some won’t but it should be a matter of making the change and seeing exceptions.