I’ve been running a home assistant setup for a few years now, but in the last 6 months we’ve been finding our Zigbee based devices starting to behave quite erratically to the point we no longer trust it but are at a loss on how to identify the root cause of the problems or how to proceed.
The network was in place for over a year and was functioning well, then in the last 5-6 months for no obvious reason it started to misbehave to the point we are getting desperate.
It started with the occasional delay on an action taking place, For example, sending a light.turn_on could take 3 to 15 seconds to happen. Often the whole network will stop responding to instructions, then execute all the queued up requests in rapid order (such as turning lights on/off/back on etc)
The behaviour is somewhat erratic, sometimes the system will work for days/weeks on end without fault then will start to be erratic at apparent random
This seems to be now esclating to new, stranger behviours like some lights dimming in a loop or flashing/blinking apparently at random (e.g. one bulb in a group of five will just start ramping up from 0 brightness repeatedly) often it is a different bulb in a different room. We do not have any automations or behaviours assigned to single bulbs, everything is attached to a Zigbee group of bulbs in ZHA.
In the last week, We’ve had a small number of devices (a Ikea Smart-plug and a Philips Hue bulb fall off the network and had to be reset and paired again) which is new troubling behaviour.
What I’ve tried
-
Checked for interference / Tried other channelsUsing the AP scanning function on my UnifiAP (2 AP’s in the property, running on Chanel 1 and 6) to get channel usage which shows that most channels (and Zigbee Channel 20, 2450 MHz, which overlaps with Channel 8 on Wifi) is showing low utlization/noise, as this is the only real tool I have to check this)
-
Moving the coordinator to another part of the house, and made sure that it is away from any other electronic device or the Access points
-
Adding an additional router to boost signal on the other side of the house
-
Rebuilding the Zigbee network fully with a factory reset and repairing of each device in place.
-
Replacing the Cordinator from a ElectroLlama to a SBLZ-06
-
Replacing Zigbee2MQTT with ZHA
-
Rebuilding Home Assistant fully
-
Moving HA to a more powerful host deviceWe thought for a little while that rebuilding HA had helped with the issues but it is unclear if this is the case or not
The new host is sually sitting around 12% CPU usage and about 10% ram so it does not seem to be taxed in any shape or way. Looking at the Zigbee network map in ZHA (attached) we can see most of the connecting lines are ‘yellow’ and occasionally green. (edited)
(The red device is one of the Ikea Smart-plugs we have not gotten around to re-pairing back to the network)
The network itself is composed of 72 devices, with the following breakdown:
33 Philips Hue bulbs (Mains Powered)
6 Ikea ‘Tradfri’ bulbs (Mains powered)
6 Philips Hue Remotes (Battery)
2 Ikea Tradfri Remotes (Battery)
5 Xaomi Miliwave motion detectors (Mains powered USB)
5 Xaomi Aquaia Door sensors (Battery)
And this is spread over 3 floors in a normal sized UK home (i.e., fairly small) (edited)
I’ve been trying to collect debug logs from HA and the SBLZ-06 zigbee controller, even from just today but honestly Im not sure exactly what is relevent, but please find attached some snippets from HA/Coodinator that keep cropping up
HA logs, this error shows up several times a day:
Logger: homeassistant
Source: /usr/src/homeassistant/homeassistant/runner.py:112
First occurred: 03:18:26 (10 occurrences)
Last logged: 13:24:11
Error doing job: Task exception was never retrieved (None)
Traceback (most recent call last):
File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1120, in request_callback_rsp
return await callback_rsp
^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1117, in request_callback_rsp
async with asyncio_timeout(timeout):
~~~~~~~~~~~~~~~^^^^^^^^^
File "/usr/local/lib/python3.13/asyncio/timeouts.py", line 116, in __aexit__
raise TimeoutError from exc_val
TimeoutError
And often this error as well
2025-05-06 11:01:40.841 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved (None)
Traceback (most recent call last):
File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1118, in request_callback_rsp
await self.request(request, timeout=timeout, **response_params)
File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1079, in request
response = await response_future
^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1117, in request_callback_rsp
async with asyncio_timeout(timeout):
~~~~~~~~~~~~~~~^^^^^^^^^
File "/usr/local/lib/python3.13/asyncio/timeouts.py", line 116, in __aexit__
raise TimeoutError from exc_val
TimeoutError
2025-05-06 11:01:40.843 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved (None)
Traceback (most recent call last):
File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1120, in request_callback_rsp
return await callback_rsp
^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.13/site-packages/zigpy_znp/api.py", line 1117, in request_callback_rsp
async with asyncio_timeout(timeout):
~~~~~~~~~~~~~~~^^^^^^^^^
File "/usr/local/lib/python3.13/asyncio/timeouts.py", line 116, in __aexit__
raise TimeoutError from exc_val
TimeoutError
We seem to get a lot of these errors on the log in the SLZB-06 but the time-stamps dont seem to line up with the errors in HA so we dont know if they are related or not?
[10:31:04] zb_packet | wrong paket len: 2 expected: 4
[10:41:43] zb_packet | wrong paket len: 2 expected: 4
[10:52:29] zb_packet | wrong paket len: 57 expected: 202
[11:21:13] zbSelfOta | Heap: 2684
[11:29:47] zb_packet | wrong paket len: 5 expected: 70
[11:39:45] zb_packet | wrong paket len: 5 expected: 49
[12:21:13] zbSelfOta | Heap: 2684
[12:23:59] zb_packet | wrong paket len: 2 expected: 4
[12:40:58] zb_packet | wrong paket len: 2 expected: 4
[13:05:23] zb_packet | wrong paket len: 2 expected: 4
[13:05:23] zb_packet | wrong paket len: 13 expected: 14
[13:21:13] zbSelfOta | Heap: 2684
[13:24:45] zb_packet | wrong paket len: 5 expected: 164
[13:39:03] zb_packet | wrong paket len: 2 expected: 4
[13:40:02] zb_packet | wrong paket len: 5 expected: 176
[13:48:00] zb_packet | wrong paket len: 43 expected: 198
[14:21:13] zbSelfOta | Heap: 2684
[14:56:40] zb_packet | wrong paket len: 5 expected: 43
[15:21:13] zbSelfOta | Heap: 2684
And I’ll attach a copy of the Debug logs from ZHA from when we managed to capture it while the behaviour was happening but the logs are extensive! I’ve had to upload it to GoogleDrive as its too large to share on pastebin etc.