Hi everyone,
I’ve been hitting a strange issue with my HA installation recently and wanted to post about it before debugging further, just in case someone else has come across it. I’m sure it’s something in my configuration/setup since otherwise it would be a widespread issue, but it’s proving hard to isolate the issue.
Every day or so, HA starts to fail because it has hit the file descriptor limit (currently set to 1024 on my system). The cause seems to be a large number of TCP socket connections to and from ephemeral ports on localhost, for example:
tcp 0 0 localhost:36072 localhost:46349 ESTABLISHED hass 427633 31009/python3.8
tcp 0 0 localhost:41709 localhost:60174 ESTABLISHED hass 352368 31009/python3.8
tcp 0 0 localhost:41516 localhost:39049 ESTABLISHED hass 357908 31009/python3.8
tcp 0 0 localhost:40408 localhost:46625 ESTABLISHED hass 444383 31009/python3.8
I’ve spent some time debugging with strace and wireshark, and it looks like nothing is sent over these connections beyond the usual TCP handshake. The fact that both ends are ephemeral ports also seems unusual and probably wrong.
When the fd limit is hit, I do see errors like this (amongst others which are all just symptomatic of being out of file descriptors):
Mar 27 17:15:59 hostname hass[4687]: 2021-03-27 17:15:59 ERROR (MainThread) [homeassistant] Error doing job: socket.accept() out of system resource
Mar 27 17:15:59 hostname hass[4687]: Traceback (most recent call last):
Mar 27 17:15:59 hostname hass[4687]: File "/nix/store/yl69v76azrz4daiqksrhb8nnmdiqdjg9-python3-3.8.8/lib/python3.8/asyncio/selector_events.py", line 164, in _accept_connection
Mar 27 17:15:59 hostname hass[4687]: conn, addr = sock.accept()
Mar 27 17:15:59 hostname hass[4687]: File "/nix/store/yl69v76azrz4daiqksrhb8nnmdiqdjg9-python3-3.8.8/lib/python3.8/socket.py", line 292, in accept
Mar 27 17:15:59 hostname hass[4687]: fd, addr = self._accept()
Mar 27 17:15:59 hostname hass[4687]: OSError: [Errno 24] Too many open files
It’s possible this is the offending code that’s failing, or it could be something unrelated but since it’s async it’s not easy to track back to where the accept was initiated from. I’ve tried attaching vscode remotely, but it won’t let me set function breakpoints so it’s of limited help.
Other things I’ve tried:
- Disabling integrations one by one
- Setting debug logging as the default and eyeballing logs for connections to localhost
I’m running HA 2021.3.4 on Linux (NixOS) x86_64, not using docker. Unfortunately I’ve been changing my installation quite a bit recently and since the problem only manifests after HA has been running for a long time, it’s hard to say when this started happening. It definitely also occurred with 2021.3.3.
The error is similar to HA Crashing with OSError: [Errno 24] Too many open files but in that case the destination address is consistent, it’s not in my case.
If anyone has any advice on how best to debug further it would be appreciated.
Thanks in advance!