Help desired – ZHA groups/multicast unreliability, `APS_DEFRAG_DEFERRED`

Hey all,

For a few months now, my EcoSmart BR30 light bulbs which are in ZHA groups per room have been unreliable, occasionally seeming to lock up and not responding to commands to turn on/off or to change color/brightness. If I wait a few seconds and try again, usually it’ll go back to normal. It seems to be worst when I do multiple commands in quick sequence, like turning off right after I turn on. This issue doesn’t seem to reproduce when I control an individual bulb within the group, even when toggling really quickly. These bulb groups were working very reliably for a few months after moving into this house in June and getting these new bulbs all set up, but maybe some time around September or October this problem first started. I don’t know exactly when, unfortunately today has been my first free day to sit down and reach out for help. It has persisted through several HA updates and of course restarts.

I am running HA 2021.1.5 in a Docker container on a Linux desktop, using ZHA for Zigbee, with the HUSBZB-1 USB stick for Zigbee and Z-Wave.

My first suspicion was that this was a reception problem, as the house is large and old and we had WiFi reception issues due to some really thick and heavy walls, and the server with USB stick is in the basement. However, I tried a few things to address this with no change:
(1) I attached my Zigbee USB stick to a USB extender and taped the stick high on the wall, away from my computer.
(2) I found that the Zigbee channel that the stick was using overlapped in frequency with WiFi 2.4GHz channel 1, and at the time my WiFi was using channel 1 for 2.4GHz (on auto mode), so I configured my WiFi auto channel selector to exclude channel 1 and only choose from channels 6 or 11. Analyzing the nearby WiFi signals shows that none of my neighbors have strong channel 1 signals nearby that should be affecting anything.
Some other evidence further suggests that reception is not the issue:
(1) The overhead bulbs in my basement, right next to my Zigbee USB stick with no walls or any other form of interference between them, exhibit the same issue just as often as anywhere else in the house.
(2) When the issue happens, it applies the same to all bulbs in the group. I’m not familiar with the deep implementation details of Zigbee multicast, but if there were reception issues I’d sometimes expect some members of the group to receive the message and others wouldn’t, even if they’re in the same room.

My current suspicion is some sort of software issue, either in the bulbs themselves (but why would it start after a few months?) or in ZHA (I wish I had taken more notice of if this happened after a HA upgrade). I have turned on ZHA debug logging, reproduced the issue, and captured this: http://sprunge.us/K1k0ts
If you search for APS_DEFRAG_DEFERRED, I believe that’s the timing where the issue occurred. Previous times that I’ve attempted to debug this using the logs, I’ve seen the same status code come up when the issue happens.

Around this same time, I noticed that my hand-rolled adaptive/Circadian lighting setup started to turn on these bulbs and they’d be an incorrect lighting temperature, and then later on they’d be updated, which leads me to believe that the “turn on” command for these bulbs under the hood ends up being one which turns the light on and then another that sets the temp/brightness, and that often the first command would succeed then the second would fail.

What do you think? Thank you in advance for your help!

An update: I tried updating the firmware on my USB stick using https://github.com/walthowd/husbzb-firmware, the version is now 6.6.5, and the issue is still there with no change.

I was pointed to this in another thread, I note it looks like you are staying up with HA upgrades. With that seems that quite a bit of changes are occurring under the covers with ZHA and Bellows. Not sure which coordinator code your HUSBZB-1 USB uses, but wondering if you might have been bit by one of these upgrades? Seems like a lot going on in order to support both ZLL and now the Zigbee 3 lighting. Have a look at the options listed in second link below, some seem to sound in the area of the message you saw in log. I see posts from others with experience that some lights that ‘route’ can have problems, no way to turn this off with a firmware upgrade to lights that know of, what does your routing map look like. Maybe removing the bulbs and re adding them ‘VIA’ a router other than one of the bulbs???:

‘# Tons and tons of options’

CONFIG_MULTICAST_TABLE_SIZE
CONFIG_FRAGMENT_WINDOW_SIZE
CONFIG_FRAGMENT_DELAY_MS

@dproffer - I definitely suspect one of the upgrades that I did a few months back. I tried at one point to downgrade to a version I might have had installed at that time to see if that would resolve it and give me a lead of how to fix it going forward, but I didn’t know exactly what version I was running (if I could find out which Docker image versions I had deployed in the past that would help), and when I picked a random version I got startup crashes with my modern configuration and databases and it didn’t seem productive to debug all those and try to get that old version running.

I just turned on source routing with a table size of 100 (the ZHA integration says I currently have 88 devices, though some of them are offline as I haven’t needed them since moving out of my old apartment). The issue is still happening, but one of the docs mentioned that it takes a while for source routing to rearrange things so I’ll try it again tomorrow morning. Thank you for the tips!

One thing I have considered as a desperate last measure is to buy another Zigbee USB stick of a different make/model, set it up with something like Zigbee2MQTT in another container, and add some of these bulbs there. If it works then worst case I could run my bulbs and other devices through two separate networks, but ideally it would let me narrow down which part of the stack the problem might be in. This assumes that ZHA and Zigbee2MQTT don’t use the same libraries under the hood, like Bellows/Zigpy or anything like that.

I will be interested to see if you see any difference with the higher layer source routing.

As you suspect, I think there are so many moving parts in HA with each release. It is very hard to trace what changes are about to occur and how they might effect current config. Not to mention whether you can revert…

Zigbee is a real wild west, not that I am any sort of expert on the topic.

Unfortunately I think you are on the right track to have multiple zigbee network in your quiver. And accept that you might need to move devices between them. I found zigbee2mqtt a nice and solid platform, I moved to ZHA as part of a server device consoldiation I was doing, but in hindsight, moving to a new zigbee2mqtt docker image might have been a better move. I’m not dissing the good work the ZHA folks are doing, but now after moving, I continue prefer having a MQTT ‘wall’ between my devices and device controllers and HA.

Not that I am recommending this, but my experience has been that the best third party zigbee network provider has been Samsung Smartthings. Zigbee2Mqtt, ZHA and others are catching up, but Smartthings had both some great zigbee gurus and Samsung zigbee presence for the longest time. If you wanted a OTA firmware upgrade for device other than Phillips, it was the only place to go IMHO.

To that I ask you, do you know if there are OTA firmware upgrades for your lights? That is another whole dimension to the path to a good solution.

So, yes I think trying your bulbs on zigbee2mqtt or deconz (first is less expensive I think) is a path you, unfortunately, should probably be ready for. I am not recommending buying a Smarthings 2 (I think that is latest) hub, but if you come across one for zero cost or a beer, grab it for your toolbox, if for no other use (which where mine is) as a possible OTA firmware upgrader.

Unfortunately, source routing did not seem to do the trick, even after giving it a couple days to adjust and rebalance.

Someone in the Discord advised me that APS_DEFRAG_DEFERRED means that the Zigbee network is flooded, and that routers repeating and re-broadcasting can contribute to that. I just tried changing some constants in the Bellows library – I set EZSP_DEFAULT_RADIUS=2 and EZSP_MULTICAST_NON_MEMBER_RADIUS=1, and restarted my container. I’ll see how that performs over the next couple days.

Well, bummer, I was trying to see what effect the source routing option in the debug logs. I think I see where it logs that it starts but have as yet seen any log entry that says something is happening… It additionally frustrating that you really don’t have good instrumentation and are basically flying blind, with each change.

@mmallozzi did changing those variables end up working? I have been searching for a solution to the Status.APS_DEFRAG_DEFERRED error on my network of about 30 Ecosmart bulbs, both A19 and BR30. It isn’t just them though, my IKEA bulbs are also now being turned on by Adaptive Lighting after being turned off by a remote. After going through the same troubleshooting steps as you with checking Wifi Channels and all that, I hope we can solve the network flooding issue! The next thing I’m going to try is reducing my update frequency on Adaptive Lighting to 15-30 minutes, staggered, vs 2-5 minutes and see if that reduces the network traffic enough.

Sorry I forgot to update – not much luck on that one. I have observed a bit of a change, which seems to have traded off a bit more reliability in the APS_DEFRAG_DEFERRED department (no numbers, might be wishful thinking) for an occasional issue where one bulb in a group doesn’t get the message. That latter issue has mostly stabilized and gone away though, it happened mostly towards the beginning – but not right away, as I think it took a day or two for the changes to really kick in.

My next step will be to order another Zigbee stick that is well supported by zigbee2mqtt, set up a new network there with some of the rooms that I haven’t installed these bulbs in yet, and see how the reliability feels. Or even set up that other network with another instance of HA/ZHA, step through the major version updates of HA one week at a time, and try to see if a particular version started breaking things. Maybe I can automate it to turn lights on/off with a timer and measure APS_DEFRAG_DEFERRED instances in the logs, so that I can have a more objective comparison and not rely on manually turning the lights on and off in rooms that I don’t use much yet.

Hey so I think I fixed it on my end. I had my Adaptive Lighting interval set to update every 90-100 seconds and changed them all to around 8000-1200 seconds, staggered so they won’t update all at once. Everything is working well now, almost perfectly reliable unless I purposefully overload the network by pressing buttons quickly. If you do come up with a APS_DEFRAG_DEFERRED tracker, please ping me because I’d love to see a graph of when it occurs :slight_smile:

That’s great news! I use my own custom implementation of adaptive lighting, so I’ll see if I can try the same thing there – right now it’s tied to sensor.date_time_iso so it’s probably once a minute all at the same time.

I’m having the same issue with a bajillion of the EcoSmart bulbs. I was happy to see you found a solution to the APS_DEFRAG_DEFERRED problem. However, I have a couple of questions about your solution:

Do you have the adaptive lighting interval set to 8000 or 800 seconds? Seems like 800 is more likely.

How do you stagger the updating of the adaptive lighting? I don’t see that option in the documentation.
Do you just have all of the individual switches set to update at different intervals? Does adaptive lighting do the update from when the light was turned on or at a specified time?

Thanks.

Hey can you please explain how you changed the update the interval and, more importantly, stagger the updates so they don’t flood the network? I just read this thread while getting more and more excited, only to find it petered out just before the explanation that I was looking for! :wink:

The problem with the Ecosmart lights is that their firmware crashes if it receives a command while executing a previous command. Redditor u/Wwalltt seems to have the best handle on what’s going on, see this thread: https://www.reddit.com/r/homeassistant/comments/kfxs8z/ecosmart_homedepot_zigbee_bulbs_for_797_per_2pack/ggbuxt9/?context=1.

Therefore, the goal is to prevent or reduce the odds that a light (or any light in your network) will be executing an update when you send it another command. This wouldn’t be a problem except Adaptive Lighting automatically updates the lights on a regular frequency and I got smart bulbs specifically to change colors and brightness throughout the day.

My suggestion is to increase the update “interval” to a larger number. Reducing the update frequency reduces the chance you or an automation try to update a light while it’s updating through Adaptive Lighting. If you have multiple Adaptive Lighting configurations (see my note below), you can stagger all of your intervals to minimize the chance of multiple lights/light groups of updating at once. Technically staggering updates increases the odds that any one group is updating while you try to, say, manually turn everything off. But for me it’s more likely for a conflicting Adaptive Lighting update to be sent than a conflicting “All On/Off” command.

@Delorean14 How do you change the interval? When you add the Adaptive Lighting integration, the integration gets a card in the Integrations section. Hit Configure and a window pops up that lets you configure it.

Note on groups: It helps when you group the lights in Zigbee so that updates are sent only once to multiple lights, instead of one command per light. You still have to have multiple Adaptive Lighting configurations for each group of bulbs that you want to control independently. If you only set up one Adaptive Lighting configuration for all lights in your house, any lights that are not on will be turned on when the update gets sent. I prefer to use motion sensors to turn off lights in unoccupied rooms and save power, but that means an Adaptive Lighting config for each room in the house.