HA continually opens connections to localhost causing fd leak

Hi everyone,

I’ve been hitting a strange issue with my HA installation recently and wanted to post about it before debugging further, just in case someone else has come across it. I’m sure it’s something in my configuration/setup since otherwise it would be a widespread issue, but it’s proving hard to isolate the issue.

Every day or so, HA starts to fail because it has hit the file descriptor limit (currently set to 1024 on my system). The cause seems to be a large number of TCP socket connections to and from ephemeral ports on localhost, for example:

tcp        0      0 localhost:36072         localhost:46349         ESTABLISHED hass       427633     31009/python3.8     
tcp        0      0 localhost:41709         localhost:60174         ESTABLISHED hass       352368     31009/python3.8     
tcp        0      0 localhost:41516         localhost:39049         ESTABLISHED hass       357908     31009/python3.8     
tcp        0      0 localhost:40408         localhost:46625         ESTABLISHED hass       444383     31009/python3.8     

I’ve spent some time debugging with strace and wireshark, and it looks like nothing is sent over these connections beyond the usual TCP handshake. The fact that both ends are ephemeral ports also seems unusual and probably wrong.

When the fd limit is hit, I do see errors like this (amongst others which are all just symptomatic of being out of file descriptors):

Mar 27 17:15:59 hostname hass[4687]: 2021-03-27 17:15:59 ERROR (MainThread) [homeassistant] Error doing job: socket.accept() out of system resource
Mar 27 17:15:59 hostname hass[4687]: Traceback (most recent call last):
Mar 27 17:15:59 hostname hass[4687]:   File "/nix/store/yl69v76azrz4daiqksrhb8nnmdiqdjg9-python3-3.8.8/lib/python3.8/asyncio/selector_events.py", line 164, in _accept_connection
Mar 27 17:15:59 hostname hass[4687]:     conn, addr = sock.accept()
Mar 27 17:15:59 hostname hass[4687]:   File "/nix/store/yl69v76azrz4daiqksrhb8nnmdiqdjg9-python3-3.8.8/lib/python3.8/socket.py", line 292, in accept
Mar 27 17:15:59 hostname hass[4687]:     fd, addr = self._accept()
Mar 27 17:15:59 hostname hass[4687]: OSError: [Errno 24] Too many open files

It’s possible this is the offending code that’s failing, or it could be something unrelated but since it’s async it’s not easy to track back to where the accept was initiated from. I’ve tried attaching vscode remotely, but it won’t let me set function breakpoints so it’s of limited help.

Other things I’ve tried:

  • Disabling integrations one by one
  • Setting debug logging as the default and eyeballing logs for connections to localhost

I’m running HA 2021.3.4 on Linux (NixOS) x86_64, not using docker. Unfortunately I’ve been changing my installation quite a bit recently and since the problem only manifests after HA has been running for a long time, it’s hard to say when this started happening. It definitely also occurred with 2021.3.3.

The error is similar to HA Crashing with OSError: [Errno 24] Too many open files but in that case the destination address is consistent, it’s not in my case.

If anyone has any advice on how best to debug further it would be appreciated.

Thanks in advance!

In case anyone else hits this, I think I made some progress on tracking the issue down to the ha-dyson custom component (which is still under development): https://github.com/shenxn/ha-dyson/issues/34.

Also, if anyone is debugging a similar issue, I found the Python audit events quite useful to find the problem: https://docs.python.org/3/library/audit_events.html. I just hacked a hook like this into the Home Assistant start script:

    import sys                                                                                                                                                                                                                                                                                                                                                                                                                                           
    import traceback                                                                                                                                                                                                                   
    def _my_audit_hook(event, args):                                                                                                                                                                                                   
        if event.startswith("socket."):                                                                                                                                                                                                
            msg = "----------------------------------\n"                                                                                                                                                                               
            msg += f"AUDIT {event} {args}\n"                                                                                                                                                                                           
            msg += "".join(traceback.format_stack())                                                                                                                                                                                   
            msg += "----------------------------------\n"                                                                                                                                                                              
            print(msg, file=sys.stderr)                                                                                                                                                                                                
    sys.addaudithook(_my_audit_hook)           

@gr_h_m could you elaborate on where you added this hook code into HA?

Might need something like this to troubleshoot this new issue.

(Tagging @frenck in case he has additional ideas.)

I added it to __main__ somewhere round here: core/homeassistant at 559e28143147b85a606ef76e409b14a04c6bef3f · home-assistant/core · GitHubmain.py#L155.

Here was the final audit hook I ended up with:

import sys
import socket
import traceback
def _my_audit_hook(event, args):
    if event == "socket.bind"
        sock, addr = args
        addr_host, addr_port = addr
        if sock.type == socket.SOCK_STREAM and addr_port == 0:
            msg = "----------------------------------\n"
            msg += f"AUDIT {event} {sock} {addr}\n"
            msg += "".join(traceback.format_stack())
            msg += "----------------------------------\n"
            print(msg, file=sys.stderr)
sys.addaudithook(_my_audit_hook)

Hope that helps!

Thanks!

Had to make a couple of modifications to get this to work.

You missed a “:” for the added if statement above, which prevents HA from even starting.
And even though I don’t use IPv6 it seems to still be present, and “addr” for IPv6 is a four-tuple, and this issue makes the HA GUI not come up since this hook interrupts even that.

Updated code:

import sys
import socket
import traceback
def _my_audit_hook(event, args):
    if event == "socket.bind":
        sock, addr = args
        addr_host, addr_port, *_ = addr
        if sock.type == socket.SOCK_STREAM and addr_port == 0:
            msg = "----------------------------------\n"
            msg += f"AUDIT {event} {sock} {addr}\n"
            msg += "".join(traceback.format_stack())
            msg += "----------------------------------\n"
            print(msg, file=sys.stderr)
sys.addaudithook(_my_audit_hook)

I recently encountered this issue, and was able to determine the cause by using netstat (netstat -A inet -p).

The netstat output showed that there were hundreds of UDP connections from my home assistant instance to a single LIFX bulb. A potential issue with the Lifx integration then. The only cause I can think of is that I recently replaced my ISP router, and this correlates to around when the issue started. None of my other lifx bulbs are experiencing this issue though.

It’s possible resetting or removing and re-adding the bulb could resolve this but I went fully nuclear, and just replaced the problem bulb with a spare Philips Hue bulb I had lying around.