HomeAssistant becomes unresponsive, KNX Interface errors in _SelectorDatagramTransport

Hi!

For weeks now I have to reset my HA almost every day. It becomes unresponsive after several hours and I see a large memory peak when it does.
The first things that fail are KNX entites (may be the consequence instead of a cause). Afterwards, all communications tend to fail with HTTPS and SSL errors. (tuya, Daikin, scraper, dhcp tasks, switchbot).

So I dug into the logs, and at every restart I can identify some expected errors:

  • TemplateErrors for unknown values at restart/crash
  • mqtt that has some doulbe ID’s
  • Ancient config that’s still in there

The following appears also quite often, but that should not freeze my HA:

WARNING (MainThread) [xknx.log] Error: KNX bus did not respond in time (2.0 secs) to GroupValueRead request for:

But when the memory peak appears I get the following error that I cannot trace back:

ERROR (KNX Interface) [homeassistant] Error doing job: Exception in callback _SelectorDatagramTransport._read_ready()
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 1163, in _read_ready
    self._protocol.datagram_received(data, addr)
  File "/usr/local/lib/python3.11/site-packages/xknx/io/transport/udp_transport.py", line 48, in datagram_received
    self.data_received_callback(data, addr)
  File "/usr/local/lib/python3.11/site-packages/xknx/io/transport/udp_transport.py", line 96, in data_received_callback
    self.handle_knxipframe(knxipframe, HPAI(*source))
  File "/usr/local/lib/python3.11/site-packages/xknx/io/transport/ip_transport.py", line 68, in handle_knxipframe
    callback.callback(knxipframe, source, self)
  File "/usr/local/lib/python3.11/site-packages/xknx/io/tunnel.py", line 302, in _request_received
    self._tunnelling_request_received(knxipframe.body)
  File "/usr/local/lib/python3.11/site-packages/xknx/io/tunnel.py", line 513, in _tunnelling_request_received
    self._send_tunnelling_ack(
  File "/usr/local/lib/python3.11/site-packages/xknx/io/tunnel.py", line 541, in _send_tunnelling_ack
    self.transport.send(
  File "/usr/local/lib/python3.11/site-packages/xknx/io/transport/udp_transport.py", line 160, in send
    raise CommunicationError("Transport not connected")
xknx.exceptions.exception.CommunicationError: Transport not connected

The most recent logfile can be found here: home-assistant_2023-12-24T09-17-37.451Z.log - Google Drive

Hi :wave:!

That’s a result of the Knx “connection” timing out but still trying to send something. (Kind of a race condition).

I’d try to disable custom integrations (and maybe Addons) and see if the problem disappears. There is even some “safe mode” now that may help with that.

Other than that, there is a profiler that may help if you know how to use it.
See Profiler - Home Assistant
Or Instructions to install Py-spy on HAOS

So, This is what I did so far:

  • restarted in Safe mode
    • same behabvior: stall after a few hours, lots of HTTP errors on the (non-custom) integrations
  • disabled all custom integrations from HACS, also noderd companion, hacs itself
  • removed all obsolete integrations
  • removed deprecated yaml in my configuration.yaml
  • cleanup all yaml, based on the watchman report

All of the above resulted in a slow and unresponsive system.
(unresponsiveness was to be found at: Ring, esphome, mqtt, zigbee, tuya, android tv)

I tried the opposite: disabled the KNX integration. → this resulted in a stable remaining HA setup. (although it really depends on KNX) So I think to pinpoint it to the KNX integration.

during my investigation, I saw in the logbook of the KNX Interface Individual address, references to 15.15.255, which I never configured in the past. So I reconfigured the KNX IP Interface according the manual (its a Weinzierl 730 disguised as an eelectron IN00A02IPI).
this is my knx setup:

Device address: 1.1.255 (address within ETS topology)
Connection 1:    1.1.250 (address within local settings)
Connection 2:    1.1.251 (assigned by learn key)
Connection 3:    1.1.252 (assigned by learn key)
Connection 4:    1.1.253 (assigned by learn key)
Connection 5:    1.1.254 (assigned by learn key)

Currently I see some behavior in the logbook that make sense:

KNX Interface Individual address changed to 1.1.250
12:16:59 - 13 minutes ago
KNX Interface Individual address became unavailable
12:16:26 - 13 minutes ago
KNX Interface Individual address changed to 1.1.252
12:16:24 - 13 minutes ago
KNX Interface Individual address became unavailable
12:16:20 - 13 minutes ago
KNX Interface Individual address changed to 1.1.251
12:16:18 - 14 minutes ago
KNX Interface Individual address became unavailable
12:16:14 - 14 minutes ago
KNX Interface Individual address changed to 1.1.254
12:16:10 - 14 minutes ago
KNX Interface Individual address became unavailable
12:16:10 - 14 minutes ago
KNX Interface Individual address changed to 1.1.253
12:16:09 - 14 minutes ago
KNX Interface Individual address became unavailable
12:15:58 - 14 minutes ago
KNX Interface Individual address changed to 1.1.251
12:15:50 - 14 minutes ago
KNX Interface Individual address changed to 1.1.250
12:15:50 - 14 minutes ago

So now I’m really stuck. (I didn’t get to the profiler so far…)

Thanks for looking into this…

Aha. So you get constant Knx disconnections. Does your ip interface provide any kind of logs?

Other than that, I guess you can observe the communication with Wireshark. Or check / change the cabling between the interface and HA.

But I couldn’t think of why this would lead to an unresponsive system … :person_shrugging: