Troubleshooting unusual/unreligable Zigbee network

Slychocobo · May 6, 2025, 3:56pm

I’ve been running a home assistant setup for a few years now, but in the last 6 months we’ve been finding our Zigbee based devices starting to behave quite erratically to the point we no longer trust it but are at a loss on how to identify the root cause of the problems or how to proceed.

The network was in place for over a year and was functioning well, then in the last 5-6 months for no obvious reason it started to misbehave to the point we are getting desperate.

It started with the occasional delay on an action taking place, For example, sending a light.turn_on could take 3 to 15 seconds to happen. Often the whole network will stop responding to instructions, then execute all the queued up requests in rapid order (such as turning lights on/off/back on etc)

The behaviour is somewhat erratic, sometimes the system will work for days/weeks on end without fault then will start to be erratic at apparent random

This seems to be now esclating to new, stranger behviours like some lights dimming in a loop or flashing/blinking apparently at random (e.g. one bulb in a group of five will just start ramping up from 0 brightness repeatedly) often it is a different bulb in a different room. We do not have any automations or behaviours assigned to single bulbs, everything is attached to a Zigbee group of bulbs in ZHA.

In the last week, We’ve had a small number of devices (a Ikea Smart-plug and a Philips Hue bulb fall off the network and had to be reset and paired again) which is new troubling behaviour.

What I’ve tried

Checked for interference / Tried other channelsUsing the AP scanning function on my UnifiAP (2 AP’s in the property, running on Chanel 1 and 6) to get channel usage which shows that most channels (and Zigbee Channel 20, 2450 MHz, which overlaps with Channel 8 on Wifi) is showing low utlization/noise, as this is the only real tool I have to check this)

image353×323 64.5 KB
Moving the coordinator to another part of the house, and made sure that it is away from any other electronic device or the Access points
Adding an additional router to boost signal on the other side of the house
Rebuilding the Zigbee network fully with a factory reset and repairing of each device in place.
Replacing the Cordinator from a ElectroLlama to a SBLZ-06
Replacing Zigbee2MQTT with ZHA
Rebuilding Home Assistant fully
Moving HA to a more powerful host deviceWe thought for a little while that rebuilding HA had helped with the issues but it is unclear if this is the case or not

The new host is sually sitting around 12% CPU usage and about 10% ram so it does not seem to be taxed in any shape or way. Looking at the Zigbee network map in ZHA (attached) we can see most of the connecting lines are ‘yellow’ and occasionally green. (edited)

(The red device is one of the Ikea Smart-plugs we have not gotten around to re-pairing back to the network)

The network itself is composed of 72 devices, with the following breakdown:
33 Philips Hue bulbs (Mains Powered)
6 Ikea ‘Tradfri’ bulbs (Mains powered)
6 Philips Hue Remotes (Battery)
2 Ikea Tradfri Remotes (Battery)
5 Xaomi Miliwave motion detectors (Mains powered USB)
5 Xaomi Aquaia Door sensors (Battery)

And this is spread over 3 floors in a normal sized UK home (i.e., fairly small) (edited)

I’ve been trying to collect debug logs from HA and the SBLZ-06 zigbee controller, even from just today but honestly Im not sure exactly what is relevent, but please find attached some snippets from HA/Coodinator that keep cropping up

HA logs, this error shows up several times a day:

Logger: homeassistant
Source: /usr/src/homeassistant/homeassistant/runner.py:112
First occurred: 03:18:26 (10 occurrences)
Last logged: 13:24:11
Error doing job: Task exception was never retrieved (None)

Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1120, in request_callback_rsp
    return await callback_rsp
           ^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1117, in request_callback_rsp
    async with asyncio_timeout(timeout):
               ~~~~~~~~~~~~~~~^^^^^^^^^
  File "/usr/local/lib/python3.13/asyncio/timeouts.py", line 116, in __aexit__
    raise TimeoutError from exc_val
TimeoutError

And often this error as well

2025-05-06 11:01:40.841 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved (None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1118, in request_callback_rsp
    await self.request(request, timeout=timeout, **response_params)
  File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1079, in request
    response = await response_future
               ^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1117, in request_callback_rsp
    async with asyncio_timeout(timeout):
               ~~~~~~~~~~~~~~~^^^^^^^^^
  File "/usr/local/lib/python3.13/asyncio/timeouts.py", line 116, in __aexit__
    raise TimeoutError from exc_val
TimeoutError
2025-05-06 11:01:40.843 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved (None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1120, in request_callback_rsp
    return await callback_rsp
           ^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1117, in request_callback_rsp
    async with asyncio_timeout(timeout):
               ~~~~~~~~~~~~~~~^^^^^^^^^
  File "/usr/local/lib/python3.13/asyncio/timeouts.py", line 116, in __aexit__
    raise TimeoutError from exc_val
TimeoutError

We seem to get a lot of these errors on the log in the SLZB-06 but the time-stamps dont seem to line up with the errors in HA so we dont know if they are related or not?

[10:31:04] zb_packet | wrong paket len: 2 expected: 4
[10:41:43] zb_packet | wrong paket len: 2 expected: 4
[10:52:29] zb_packet | wrong paket len: 57 expected: 202
[11:21:13] zbSelfOta | Heap: 2684
[11:29:47] zb_packet | wrong paket len: 5 expected: 70
[11:39:45] zb_packet | wrong paket len: 5 expected: 49
[12:21:13] zbSelfOta | Heap: 2684
[12:23:59] zb_packet | wrong paket len: 2 expected: 4
[12:40:58] zb_packet | wrong paket len: 2 expected: 4
[13:05:23] zb_packet | wrong paket len: 2 expected: 4
[13:05:23] zb_packet | wrong paket len: 13 expected: 14
[13:21:13] zbSelfOta | Heap: 2684
[13:24:45] zb_packet | wrong paket len: 5 expected: 164
[13:39:03] zb_packet | wrong paket len: 2 expected: 4
[13:40:02] zb_packet | wrong paket len: 5 expected: 176
[13:48:00] zb_packet | wrong paket len: 43 expected: 198
[14:21:13] zbSelfOta | Heap: 2684
[14:56:40] zb_packet | wrong paket len: 5 expected: 43
[15:21:13] zbSelfOta | Heap: 2684

And I’ll attach a copy of the Debug logs from ZHA from when we managed to capture it while the behaviour was happening but the logs are extensive! I’ve had to upload it to GoogleDrive as its too large to share on pastebin etc.

Khamrun · May 6, 2025, 5:41pm

I’m not sure that this will help? I have a Pi 4 with HA (Home Assistant) and I moved its operating system (OS) onto a larger Kingston Drive about two years ago. I use the CONBEE II USB coordinator which I’d had plugged directly into the back of the Pi 4 USB Port, right next to the Kingston drive. Initially it caused interference with my Zig-Bee devices configured this way. They dropped out or wouldn’t connect at all? I’m using ZHA to connect to my Zig-Bee devices, not MQTT. I have 28 devices connected, TRADFRI IKEA Bulbs, Aqara sensors and their “Cube” (LOVE IT), and various Amazon Zig-Bee plugs. My house is about two floors and upstairs to my bedroom is about 50ft through walls. My router is downstairs in my studio, in a corner of the house. So, lots of my Zig-Bee devices are rather far from my router. The Pi is on my studio desk about 3ft from my main router, with the USB CONBEE plugged in, now off of a reliable 3ft USB extension cable. That’s to give you an idea of my set up. Anyway, when I initially plugged the Kingston drive right next to the CONBEE plugged directly on the USB all of my devices dropped out. To fix that I tethered the CONBEE onto a reliable USB cable away from the Kingston drive and everything connected immediately. I’ve had no problem since. Could it be a hardware issue that is giving you difficulties? I’ve found the CONBEE II coordinator by Dresden Elektronik to be very reliable and again using ZHA rather than MQTT. By the sounds of you difficulties you may consider the hardware you are using. I had similar difficulty initially when I installed the Kingston drive right next to the CONBEE II coordinator. Once I moved it away from the Kingston drive, it hangs off the USB cable behind my studio desk, everything connected. Could your problem be a USB cable? Some don’t work well as they only provide power rather than reliably transfer data. Maybe check that? Could it be the SBLZ-06 Zig-Bee controller? The CONBEE II I have works very well. I’ve included a picture of my Pi connected to the Kingston drive with a “tethered” USB cable to the coordinator. Hopefully this may give you some insight into your difficulties? Sorry this is all I can think of in relationship to a similar problem I initially had. I realize it’s frustrating so possibly get just one device working reliably such as the TRADFRI IKEA Bulbs, which work really well in my house. You mentioned in your post that you were using these bulbs. Connect one in a lamp as close as you can to your HA and once it’s connected test that by moving the lamp to different locations in your home. I’m under suspicion that it may be your hardware rather than software? Sorry that you are having such difficulties. I don’t know if any of this you haven’t already tried but this is the best I can recommend. All the best and please let us know if you find a solution.

Slychocobo · May 6, 2025, 5:46pm

Hi,

At this point I dont think its a hardware issue, at least not with the Zigbee controller, as I’ve been running a ElectroLlama ZZH (Which is based on the recomended CC2652R1 chipset.

But to narrow it out I replaced it with a SLZB-06 which runs the CC2652P chipset.

From all my reading these are considered to be decent chipsets by all accounts, I’ve also ensured the devices are on extenstion cables (The SLZB-06 uses Ethernet + POE)

So its a good thought, but I dont think this is the case Im afraid.

francisp · May 6, 2025, 6:00pm

The mmwave sensors might be flooding your Zigbee network.

Slychocobo · May 6, 2025, 6:35pm

Is there any way I can check to see if that is the case?

I’m not too sure if this is the issue however, as these issue started long before the mmwave sensors were added to the network (they have only been in place for about two/three weeks) while the issues have been ongoing for months.

NathanCu · May 6, 2025, 6:44pm

Do you allow unifi to auto optimize the wifi?

That zigbee 20 with an active wifi 6 and auto optimization bothers me.

I had to cut 11 out of my wifi entirely and moved my zigbee to 25 another pattern that works is all 11. (zigbee 11 is on top of Wi-Fi 1, so they don’t interfere)

This sounds very similar to what mine was doing when unifi ‘optimized’ my wifi on top of my zigbee…

Slychocobo · May 6, 2025, 6:53pm

The Unifi is set to a manual channel configuration. Its currently using Wifi channel 1 and 6 for the two AP’s in the property and the Zigbee network is on the ‘default’ channel 20.

The ‘scan’ showed that the overlapping channels for the two are quiet (or at least, the Unifi considered them to be) From what I can see The Wifi-channels 6-11 are pretty empty around here? So moving the Zigbee to a higher channel wouldnt really hurt anything, but I’m guessing it would need all the devices to be re-paired back to the network (In which case, thats very much a weekend task!)

jerrm · May 6, 2025, 6:53pm

Like @francisp, my first thought were possible flooding from the motion sensors, some of them are very noisy with multiple updates per second.

Second thought is wondering if all the bulbs are continuously powered. Physically cutting power at a wall switch can cause problems when the bulbs go offline.

I would probably try to eliminate devices one by one and see if you get to a stable point.

Some devices just don’t play well together. Overall, I’ve been blessed with stable nets, but the worst problem I had was when I introduced a Sonoff Plug into the network. All sorts of gremlins crept in. I can’t say it was the sonoff plug specifically, but stability returned when I replaced the sonoff plug with a sengled plug.

I hate to say it was the sonoff’s fault, it might be another device at the root. A lot of folks use sonoff plugs happily, and I have had it on a smaller test net without issue. However, it was obvious that for the specific mix of devices on my primary net there was some sort of incompatibility.

Slychocobo · May 6, 2025, 6:59pm

I’m unsure how exactly I would check to see if it is a case of the motion sensors causing the problems, (although to be clear, this has been going on long before the motion sensors were added to to the network). Im happy to consider this a cause but … how would I know? Is that controlable or measurable in some way?

The bulbs are always switched on (the actual switches controlling the power to the bulbs are not easy to access, so they always recieve power)

I can try removing some devices but Im sure you can understand things like unplugging lightbulbs isnt something I can do long-term without getting people very cross with me…

Most of the network is either Hue, or Xiaomi devices, I could remove the door-sensors (But given they are battery powered, would they be causing problems?)

Presumably the logs / errors Im seeing in HA are part of the problem and should ‘stop’ (as this would be a more obvious sign of an improvement rather than waiting for a bulb to freak otu or the network to be slow)

If it is a problematic device would simply powering it off be enough or would it have to be removed from the network?

NathanCu · May 6, 2025, 8:14pm

UniFi does not scan for Zigbee. It is only looking at Wifi

Khamrun · May 6, 2025, 9:01pm

Damn! I just wasn’t sure from your original post. Of course everyones set up is different but I thought I may have some insight as to difficulties I’ve had with mine. POE, huh? Jeez, I’ve never powered anything with Ethernet before? Then again I haven’t needed to, could that be the difficulty? Sorry Slychocobo that I couldn’t give you any better direction. I have the IKEA bulbs also and they paired REALLY well, so those do work great. Thanks for getting back to me. Please let us know how you finally come to resolve the difficulty. Have a great day pal.

Slychocobo · May 6, 2025, 10:25pm

Well, I’ve discovered that HA can give an ‘Energy Scan’ of the Zigbee channels, going to try and collate a few days worth of data, see if it provides any more useful data on how busy each channel is.

Hopefully it will help.

tom_l · May 7, 2025, 12:55am

Listen to Nathan.

Zigbee channel 20 overlaps with wifi ch6. Change to zigbee ch 25.

See: ZigBee and Wi-Fi Coexistence | MetaGeek

Also set your wifi AP’s 2.4Ghz radios to Medium transmit power. Not High or Auto.

Slychocobo · May 7, 2025, 3:19pm

I’m just collating a few days worth of data from grabbing the ‘diagnostic’ .json file from ZHA every 30mins and poking it into a spreadsheet via a script to see if that gives a better picture of the network.

Given changing the channel is likely to be several hours of work re-pairing all the devices I want to make sure this is going to be an effective change.

Would I be correct in assuming that lower-numbers in that are better? (Less traffic/noise)?

NathanCu · May 7, 2025, 3:34pm

If you’re talking about the output of the energy from ZHA. Yes.

But honestly I’ve never found that tool to be much help. It snapshots a point in time. That varys wildly.

Most devices should take the channel change just fine. Some won’t but it should be a matter of making the change and seeing exceptions.

dproffer · May 8, 2025, 3:29pm

I have found tinySA to be the good tool for looking at 2.4 Ghz spectrum for Zigbee, Bluetooth and WiFi interactions. Not a USD 10K pro spectrum analyzer, however a very good tool for home users. Do support Erik Kaashoek by buying his real products.

https://www.tinysa.org/wiki/

Slychocobo · May 8, 2025, 8:36pm

The portable Scanner you’ve linked looks quite neat… I’d be tempted But for the moment I’ve let my script run every 30mins for a few days, so out of 1200 data-points, its showing that on average channels 23/24 seems to the constantly lowest energy-values.

So will try Channel 24 and… well, see what happens unless anyone can think of a good reason to use 25 (despite having higher values)

channel
11	20.19
12	28.9
13	26.21
14	24.57
15	49.28
16	58.96
17	77.68
18	75.19
19	53.67
20	51.75
21	18.8
22	13.13
23	10.47
24	9.22
25	22.45
26	48.97

NathanCu · May 8, 2025, 8:53pm

Use 25. While you should not have any issue The general pattern is to stick to channels divisible by 5 and will be better tested. Above 25 may not be supported on older hardware.

Slychocobo · May 9, 2025, 6:53pm

Okay, I’ve set it to channel 25 and its been a few hours but still seeing the same errors crop up in the logs and same occasional delays, but will leave it till tomorrow and see if theres an improvment

Slychocobo · May 10, 2025, 9:33am

Well… That did not go well.

Changed the channel to 25 yesterday at around 4pm, did not seem to be making any noticable change during the day but decided to leave it to sort itself out. Woke up this morning with not a single Zigbee device being controllable.
The ZHA intergration was marked as failed under HA with it stuck in a loop complaining that it could not communicate to the coordinator, looks like the issue started at around 1am in the morning with the logs showing that it started to complain about the ‘ZHA backup not matching existing settings’ followed a few minutes later with watchdog failures and connection timeouts.

Tried rebooting the coordinator, and HA but to no avail.

Trying to bring up the configuration page for ZHA would only present two options to either (paraphrasing) ‘Reconfigure the existing radio’ or ‘Create a new Zigbee network’ but selecting either option would fail with more time-out errors.

So at the moment my Zigbee network is flat dead and HA is refusing to communicate with the coordinator, although its web-ui and onboard logs show nothing amiss other than its not connected to ZHA.

Am I going to have to simply reset and rebuild?

2025-05-09 01:29:27.197 WARNING (MainThread) [zigpy.application] Watchdog failure
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1079, in request
    response = await response_future
               ^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/zigpy/application.py", line 661, in _watchdog_loop
    await self.watchdog_feed()
  File "/usr/local/lib/python3.13/site-packages/zigpy/application.py", line 647, in watchdog_feed
    await self._watchdog_feed()
  File "/usr/local/lib/python3.13/site-packages/zigpy_znp/zigbee/application.py", line 607, in _watchdog_feed
    await self._znp.request(c.SYS.Ping.Req())
  File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1075, in request
    async with asyncio_timeout(
               ~~~~~~~~~~~~~~~^
        timeout or self._znp_config[conf.CONF_SREQ_TIMEOUT]
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ):
    ^
  File "/usr/local/lib/python3.13/asyncio/timeouts.py", line 116, in __aexit__
    raise TimeoutError from exc_val
TimeoutError

2025-05-09 01:33:38.017 ERROR (MainThread) [aiohttp.server] Error handling request from 192.168.30.218
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/aiohttp/web_protocol.py", line 480, in _handle_request
    resp = await request_handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/aiohttp/web_app.py", line 569, in _handle
    return await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/aiohttp/web_middlewares.py", line 117, in impl
    return await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/http/security_filter.py", line 92, in security_filter_middleware
    return await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/http/forwarded.py", line 210, in forwarded_middleware
    return await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/http/request_context.py", line 26, in request_context_middleware
    return await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/http/ban.py", line 86, in ban_middleware
    return await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/http/auth.py", line 242, in auth_middleware
    return await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/http/headers.py", line 32, in headers_middleware
    response = await handler(request)
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/http.py", line 73, in handle
    result = await handler(request, **request.match_info)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/diagnostics/__init__.py", line 282, in get
    data = await info.config_entry_diagnostics(hass, config_entry)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/zha/diagnostics.py", line 82, in async_get_config_entry_diagnostics
    gateway: Gateway = get_zha_gateway(hass)
                       ~~~~~~~~~~~~~~~^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/zha/helpers.py", line 1055, in get_zha_gateway
    raise ValueError("No gateway object exists")
ValueError: No gateway object exists