ZHA error since last update

Ever since I updated my Home Assistant installation to the latest 2020.12.1, I’ve been running into trouble with my ZHA integration. About once a day my entire network goes down without any warning and I can’t seem to find a certain pattern in what leads up to the crash. Sensors and buttons just all go unresponsive and return to normal function right after restarting the server from inside the GUI.

In the meantime I’ve activated loging and this shows up every time around the time everything goes dark:

2020-12-24 11:36:14 ERROR (MainThread) [homeassistant] Error doing job: Exception in callback SerialTransport._read_ready()
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.8/site-packages/serial_asyncio/__init__.py", line 119, in _read_ready
    self._protocol.data_received(data)
  File "/usr/local/lib/python3.8/site-packages/zigpy_deconz/uart.py", line 85, in data_received
    self._api.data_received(frame)
  File "/usr/local/lib/python3.8/site-packages/zigpy_deconz/api.py", line 359, in data_received
    fut.set_result(data)
asyncio.exceptions.InvalidStateError: invalid state
2020-12-24 11:36:14 WARNING (MainThread) [zigpy_deconz.api] No response to 'Command.aps_data_indication' command with seq id '0xa3'
2020-12-24 11:36:14 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/zigpy_deconz/api.py", line 305, in _command
    return await asyncio.wait_for(fut, timeout=COMMAND_TIMEOUT)
  File "/usr/local/lib/python3.8/asyncio/tasks.py", line 498, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/zigpy_deconz/api.py", line 462, in _aps_data_indication
    r = await self._command(
  File "/usr/local/lib/python3.8/site-packages/zigpy_deconz/api.py", line 310, in _command
    self._awaiting.pop(seq)
KeyError: 163

Background info:
I’m running HA on a Raspberry Pi 4 with 4GB of RAM, inside a Docker container. ZHA is handled through a Conbee II stick that I plugged in through a USB extension cable. Everything worked great until the update, but now it barely manages a full day without intervention.

What I’ve tried myself
I’ve been unplugging and plugging the Conbee stick, I’ve updated the firmware of the stick through the Phoscon app on my laptop, I’ve been switching between stable and latest images of HA hoping a bug had been fixed in the meantime… but all to no avail.
Edit: completely removed the integration and added it again, no cigar…

I’m a bit stuck on what to try next, so any tips or pointers would be greatly appreciated!

Should someone stumble in here through Googling, I’ll pipe up and say I goofed up. Apparently UPnP connected my router to HA, keeping track of all traffic, ballooning my home_assistant_v2.db to a size the Pi couldn’t really handle. That caused massive slowdowns and processes like this one to time out. I cleared out the db file, limited the recorder time and got rid of the router integration… And now my Zigbee network is working better than ever, blazing fast.

Problem solved!

Hi there,

Just wondering if this issue is still resolved for you and was really caused by your db.
I am having exactly the same error and symptoms, however, my DB isn’t ballooning nor excessively big (>200MB mariadb running on a HA blue with noisiest sensors excluded). My system also runs fine as long as I remain on 2020.12.2, but these symptoms appear as soon as I move to 2021.x.x (and updating is the current urgent advice with ongoing security issue). I will try to disable mariadb, and restart with a clean built-in db and post my results for anyone else that struggles with this, but if your issues came back, I might have to look elsewhere for the source of the problem.

thanks in advance

To be honest, I’m not 100% sure if the size of the db was the problem, I think it’s more like the flooding with messages that it can’t process them anymore. Disabling the router integration made a massive improvement. I’ve had the problem once more, when we were watching a YouTube livestream and pychromecast felt it necessary to report that multiple times per second. ZHA just seems to crash as soon as it times out because of other processes or something like that. But besides that one occurrence, I haven’t had the problem since.

Thanks for the response!

Yeah, I tried switching from MariaDB to fresh native DB, and updated to 2021.1.3 and right away during startup I got this again:

Logger: homeassistant
Source: /usr/src/homeassistant/homeassistant/runner.py:99
First occurred: 8:54:58 AM (1 occurrences)
Last logged: 8:54:58 AM

Error doing job: Exception in callback SerialTransport._read_ready()
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.8/site-packages/serial_asyncio/__init__.py", line 119, in _read_ready
    self._protocol.data_received(data)
  File "/usr/local/lib/python3.8/site-packages/zigpy_deconz/uart.py", line 91, in data_received
    self._api.data_received(frame)
  File "/usr/local/lib/python3.8/site-packages/zigpy_deconz/api.py", line 359, in data_received
    fut.set_result(data)
asyncio.exceptions.InvalidStateError: invalid state

There really appears to be something in 2021.x.x that instantly doesn’t work with my setup and crashes zha. I suppose I have to start disabling integrations one by one to see if there is a particularly noisy one. I wish I knew of a more efficient way to debug/monitor this, as this is likely going to be a time consuming / frustrating effort.

Thanks again for now.

FYI: I appear to be having the same problem and time frame. Unfortunately, I have no solution. I’m running a wired pi3b (cpu typically 6%, 514mb ram use) and same zigbee device with ZHA. I have to reload ZHA sometimes twice a day. I suspect I might have a range problem on some of my devices (aqara, sonoff, hue - buttons, lights, switches, sensors). I also have a hue hub.

Hi John,

Thanks for chiming in. Sorry to hear you are dealing with this too, but glad to hear I am not unique with this issue. For clarity, I am running a wired HA Blue (Odroid), which also averages it’s CPU usage at 4% and memory at ~50%.

Going through endless forums, I also suspected range problems, and have re-added some devices that showed some red lines in the zigbee mesh visualisation (though no router devices seem to have a really unhealthy connection for as far as I can tell).

The thing that makes me doubt that though is the fact that it runs stable under 2020.12.x. I ran it over 24 hours (disabling my custom components) to test and the error never appeared and the zigbee network remained fully operational. It really is only as soon as I upgrade to any 2021.x.x version that the error appears and the network starts crumbling after 4 hours at most.

It’s a real (frustrating) head scratcher for me, and I am now at a loss on what else to do to trouble shoot. I scoured my log files running debug on zigpy (https://pastebin.ubuntu.com/p/Q5FKJbgn3N/) but cannot discern from them what leads to this error. Hoping with every new update that it will go back to how it once was, but starting to feel that I should not have moved from Deconz to zha.

For what it’s worth, this was the GitHub issue I opened when I first ran into the problem: https://github.com/home-assistant/core/issues/44612#issuecomment-753297168

In the meantime, I have cleared out the db, limited the items that are being recorded and disabled my UPnP integration with the router that was very disk heavy. I have it running stable for a while now. I also didn’t run into this issue at boot, it usually ran fine for a day or so, but then just timed out after a while.

Maybe the GH issue above can provide some extra information.

Thanks!

I had not seen that thread before, seems you are way further down this track than I am.

Just further info, my MariaDB instance is <100Mb and looking at the tables, none seem excessively large (I have also already limited the most obvious contenders

recorder:
  # The number of days of history to keep in the recorder database
  purge_keep_days: 1
  db_url: !secret maria_db
  exclude:
    entities:
      - sensor.time
      - sun.sun
    entity_globs:
      - sensor.date*
      - sensor.*rx*
      - sensor.*tx*
      - device_tracker.*_default
      - switch.*_sync_*
      - sensor.*estimated_current
      - media_player.plex*
  include:
    entities:
      - device_tracker.arcade

the next largest table comes from the system monitor, but would really like to keep that historic data for trouble shooting, so think I have done all I can do here:

I did also have influxDB running on the ODROID (similar config as the recorder) I am now seeing if that disabling influx improves performance.

EDIT: typo

Your DB size definitely isn’t the problem then. But in my case, I also think the problem was more about the constant reading and writing to the DB that got me in trouble.

Are you seeing any of these messages? And if so, how long?

DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event timer_out_of_sync[L]: seconds=1.797550264003803>

First off, thanks for taking the time and effort to help.

but, yea:

2021-01-18 15:03:06 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event timer_out_of_sync[L]: seconds=9.258427138000116>
2021-01-18 15:04:41 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event timer_out_of_sync[L]: seconds=4.271723421999923>
2021-01-18 15:04:46 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event timer_out_of_sync[L]: seconds=4.3299342959981>
2021-01-18 15:06:46 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event timer_out_of_sync[L]: seconds=4.294285247000516>
2021-01-18 15:06:51 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event timer_out_of_sync[L]: seconds=4.455457142004889>
2021-01-18 15:09:21 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event timer_out_of_sync[L]: seconds=4.311330643999099>
2021-01-18 15:09:26 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event timer_out_of_sync[L]: seconds=4.510349359999964>
2021-01-18 15:12:26 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event timer_out_of_sync[L]: seconds=4.323786789998849>
2021-01-18 15:12:31 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event timer_out_of_sync[L]: seconds=4.553714867997769>
2021-01-18 15:15:31 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event timer_out_of_sync[L]: seconds=4.331893601000047>
2021-01-18 15:15:36 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event timer_out_of_sync[L]: seconds=4.5901972350002325>
2021-01-18 15:18:36 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event timer_out_of_sync[L]: seconds=4.350021759997617>
2021-01-18 15:18:41 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event timer_out_of_sync[L]: seconds=4.625918940000702>
2021-01-18 15:21:41 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event timer_out_of_sync[L]: seconds=4.367913561000023>
2021-01-18 15:21:46 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event timer_out_of_sync[L]: seconds=4.6412593009990815>
2021-01-18 15:24:46 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event timer_out_of_sync[L]: seconds=4.381754918998922>
2021-01-18 15:24:51 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event timer_out_of_sync[L]: seconds=4.655054709997785>

So yes, I am getting these too… not as extreme as the 40+ second ones you were experiencing, but it might get worse over time. I thought that the whole leg up of the ODROID was the fact it had more capable I/O than the pi4. My biggest integration (as in most devices) is also my router with the unifi integration, but somewhat hesitant to remove that one as much of my presence detection is reliant on that that integration. I also removed my CC’s as I was not using the integration, but also gathered from your Github issue that that might not have been the source of the problem for you.

So, if I understand correctly, I will have to go and hunt for the cause of this, as this is likely what is causing the errors resulting in zha running to a halt?

I checked again now and I am still seeing those timer_out_of_sync messages, up to 18 seconds. They usually seem to coincide with a ChromeCast losing connection or something like that. But my ZHA instance is working fine, so I’m not sure if it’s related. I’d post in the GitHub issue and see what those people are thinking. Might be an issue with ZHA after all?

Thanks,

Just removed some stale integrations and custom components…
Will see if that improves matters, if not, I will post to the github. From some current chatter on the Discord I am also picking up that it might just be a conbee II + zha thing as it appears that this might not be the most stable of combinations.

For what it’s worth (and with limited testing, but) It appears that removing the Synology DSM integration has put an end to the ‘timer_out_of_sync’ errors in my case. Only been about 30 minutes, but those messages appeared immediately up until the removal of the integration. Jury still out on whether this stabilises zha.

Update: Now 20 hours later, not a single ‘timer_out_of_sync’ message and Zigbee is rock solid again. Thanks for the troubleshooting. It never occurred to me to look outside of zha, but clearly the Synology DSM integration was crippling my device and as a result of that zigbee kept being the first to suffer the consequences.

To be honest, I do think that something changed in ZHA that allows a timeout to take the whole integration down, I never ran into trouble before the last few versions and neither did you as far as I can tell. Regardless, I’m glad this fixed it for you. I should probably take another look at it since I’m still getting those timer out of sync notifications, if you’re not getting a single one… Maybe move the whole thing to an SSD will help… Thanks for the feedback!