Strange behaviour with nginx SSL Terminator

I’m having some strange behaviour with nginx and Home Assistant lately. The problems started when I did migrate to HassOS, but after that I was also able to reproduce it on my previous setup (I did update HA though, more details later on).

I did create a video that showcases the error:

(In case the forum doesn’t allow me to embed videos: https://youtu.be/khdQgJ3E1Yw )

After above scenario, almost nothing loads in anymore and I have to F5 the page before everything starts working again. In the companion apps I have to force close the app and start it up again.

My setup looks like:

  • Proxmox v8.1.3 running on a HP 800 G3 mini with 16GB of RAM
  • Currently LXC running dockerized Home Assistant, version Core 2023.12.3, Frontend 20231208.2. Assigned 4 cores and 4GB of RAM, currently using about 700MB
  • Also happens with HassOS LXC
  • Now also happens with my previous setup: an ubuntu bare metal running dockerized HA
  • Latest version I recall which didn’t have this issue was probably 2023.10 or 2023.11

My docker-compose.yml file looks like the following, I have not tried to go back to 2023.11 because I don’t know if I can.

version: "3"
services:
  homeassistant:
    container_name: homeassistant
    image: ghcr.io/home-assistant/home-assistant:stable
    restart: unless-stopped
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - ./config:/config
    network_mode: "host"

Relevant configuration.yml section:

# Configure a default setup of Home Assistant (frontend, api, etc)
default_config:

homeassistant:
  external_url: https://haxxxxxxx.example.com
  internal_url: https://ha.home.example.com
  country: NL
  allowlist_external_dirs:
    - "/config/camera-snapshots/"

http:
  use_x_forwarded_for: true
  trusted_proxies:
    - 127.0.0.1
    - 172.16.10.21

I have an SSL Terminator running on 172.16.10.21 which requests a wildcard certificate for all of my internal stuff, which is configured like the following (showing only configuration for internal_url, external_url looks pretty much the same but restricts location to: location ~ ^/(auth/token|api/(google_assistant|webhook/.*))$ { instead of location / and obviously changes the host, I haven’t seen any errors there:

server {
    server_name ha.home.example.com;
    listen 80;
    listen [::]:80;
    return 301 https://$host$request_uri;
}

server {
    server_name ha.home.example.com;
    listen 443 ssl;
    root /dev/null;
    access_log /var/log/nginx/ha.access.log;
    error_log  /var/log/nginx/ha.error.log;

    ssl_certificate /etc/letsencrypt/live/home.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/home.example.com/privkey.pem;
    # disable poodle attack (sslv3)
    ssl_protocols TLSv1.3 TLSv1 TLSv1.1 TLSv1.2;
    ssl_prefer_server_ciphers on;
    ssl_ciphers EECDH+AESGCM:EDH+AESGCM;
    ssl_ecdh_curve secp384r1;
    ssl_session_timeout 1d;
    ssl_session_cache shared:SSL:50m;
    ssl_session_tickets off;
    ssl_stapling on;
    ssl_stapling_verify on;

    proxy_buffering off;

    location / {
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header Host $http_host;
        proxy_set_header X-NginX-Proxy true;

        proxy_pass http://172.16.50.55:8123;
        proxy_redirect off;

        # Socket.IO Support
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

NOTE: There might be some outdated stuff in there, but I’ve been running this exact same setup since 3+ years.

Might be worth mentioning: I run MySQL as history component which sometimes takes a good amount of time to load in, but after an F5 it loads in instantly. I also run an external MQTT server and an external zigbee2mqtt docker instance.

I see no errors in my nginx or Home Assistant error log, only some logging in the Firefox Console which is visible in the video.

I don’t know anymore where to look for errors or what to try out. Can I maybe simply rollback to 2023.11 and try that out or does the database also mark a migration internally?

Oh and probably the most important thing I forgot to mention before: if I visit HA directly on the IP then I don’t see this strange behaviour, it only happens when I visit it through the SSL terminator.

I disabled ipv6 on that LXC by changing the following in sysctl:

root@home-assistant:~# cat /etc/sysctl.d/local.conf 
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1

Confirmed that ipv6 addresses are not assigned by doing ip a.

Greetings.

Ok, you have already determined that the problem is only when going through the proxy. Back in the day when I worked in secure Mobile Device Management proxies were usual suspects in problems like this. A few troubleshooting ideas:

  • What about disabling TLSv1 and TLSv1.1? Likely unrelated, but who knows. Also those versions are creaky already.
  • What happens if you go through the proxy but over HTTP?
  • What happens if you load the cert directly in the HA host and go to it over HTTPS?
1 Like

Thanks!

So I found some time to test a few things out and here are my findings:

  • Disabling TLSv1 & 1.1: These two are years deprecated already so I disabled them. As suspected, no changes in behaviour.
  • Proxy over HTTP instead of HTTPS: I got the exact same behaviour. After a few clicks here and there, I get the following in the console log in Firefox:

So the issue has nothing to do with the certificate, rather more with the proxy itself.

  • Load the certificate directly in HA: I did this and found no issues. No timeouts, no problems at all. I did this by setting the following in my configuration.yml file (currently using the dockerized HA):
http:
  ssl_certificate: /config/fullchain.pem
  ssl_key: /config/privkey.pem
  use_x_forwarded_for: true
  trusted_proxies:
    - 127.0.0.1
    - 172.16.10.21

(left the part with trusted proxy in there because I only tested this on my own PC, overwriting the host in my /etc/hosts file, didn’t want to change up a whole lot on NAT and DNS configuration just to test something out).

So it seems that the issue is mainly within nginx then. Question is: what? And… I can’t be the only one affected by this aren’t I? What am I missing here?

Ok, now to you have further confirmation of where to look at.

Many folks here use nginx but I’ve never seen that problem. Could it be related to your config? Maybe you can test with the simplest nginx config possible. Also, is the proxy up to date?

I did some more debugging and found that the culprit is actually the /api/websocket endpoint.

When not proxying, it keeps spitting out responses for about 12 minutes or so in the background. When proxying, the data stream will die after about 36-39 seconds, without the browser knowing that the proxy closed the connection / there was an error with this connection in the background (which sadly isn’t logged anywhere :confused: ).

It will take quite some time for the browser to figure out that the connection is no longer available, therefore the long spinning loading thingy. When it does figure out eventually, it will close it on the browser’s side with a 1006 error (generic), log a ton of errors about uncaught promises (look at my post above) and open up a new websocket which will -again- work for about 36-39 seconds: all functionality is restored, except the changes I did make in the meantime: those are not saved and I have to reapply them. When working on an automation for example I can’t re-save it because the button only appears when there were changes. Since I’ve already ‘saved’ those changes, I can’t re-apply them. When I refresh, I now know I have about 30 seconds to apply my changes before the socket hangs again.

I’m using nginx 1.22.1 on Debian 12 which is the latest version in the repo for this particular distro. I did some clean up and my current configuration looks like the following:

    proxy_buffering off;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header Host $http_host;

    location /api/websocket {
        # Socket.IO Support
        #proxy_http_version 1.1;
        #proxy_read_timeout 86400s;
        #proxy_send_timeout 86400s;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_pass http://172.16.50.55:8123;
    }

    location / {
        proxy_redirect default;
        proxy_pass http://172.16.50.55:8123;
    }

I separated the websocket part mainly to ease testing stuff out only on that particular endpoint.

I commented out proxy_http_version (if it wants to update to HTTP/2, be my guest! Also: the issue is also present when restricting the communication to HTTP 1.1), and tried out unsuccesfully with proxy_(read|send)_timeout, the rest is just so that HA knows where something comes from. Upgrade and Connection headers are set explicitly, according to nginx’s documentation: Using NGINX as a WebSocket Proxy


With all that being said… I saw that there were some changes related to this in 2023.12, I believe one of the following might have been the one:

In the first one I saw the following comment which looks an awful a lot like my issue as well:

mysql test failure is unrelated. still don’t know why that times out.

But that issue is closed to contributors only. This is probably very rare to see, but I believe this to be a bug under some rare circumstance / libraries I happen to be using.

I will test out if the same happens on a totally clean HA installation tonight.

Update time: so I did create a clean install of HA, enabled only the MQTT + chromecast stuff (to have a continous stream of data through the websocket going on) and configured nginx the exact same way as I have now for my HA-“production” instance. This works fine: the /api/websocket endpoint just keeps on providing information and doesn’t time out.

I don’t really know where to look further now. It must be some old configuration I think? An integration which is not playing nicely perhaps? But why only through the proxied connection? Can it be that something else is timing out? I do have 2 devices that somehow tend to disconnect as well from HA as from a short while ago, a Shelly 2.5 that every 1 hour and 45 minutes shows unavailable and then immediately comes back while turning all the relays on and an ESPHome device which disconnects at really random hours. I was going to look into this separately but maybe it’s related?

That nginx config couldn’t get much simpler I think.

It looks like some edgy situation in fact. What about nginx request logging, perhaps even in debug mode?

Also, since you are at the point of standing up other systems to troubleshoot, maybe it’s worth to install a Caddy proxy and see what happens though it.

I would just like to say - that nginx proxy manager doesn’t even have the /api/websocket line. And it proxies the whole of Home Assistant including the websocket, without any issues.

Same with uptime kuma. There is a tickbox for websockets support, and as far as we can tell - all that does it add the upgrade connection lines. Uptime Kuma for example doesn’t work unless you tick the websockets toggle.

But there are no specific extra location sections when you tick that box.

Thanks for your answer! I originally also didn’t have or need a separate location for this particular endpoint, but it might come in handy if I would need to debug :slight_smile: I can set for example debugging on for that endpoint only should I need it.

I guess I’ll need to do that.


Anyway, some more updates:

I decided to use my vacation time to just set up a new HA instance from scratch. My current setup was migrated all the way from my very first setup using a rpi3b and it grew througout the years.

This new instance was working absolutely fine… BUT it began showing the exact same behaviour! :slight_smile: So I might have a reproducible case after all.

I noticed this began happening after I enabled the device tracker SNMP plugin:

device_tracker:
  - platform: snmp
    host: !secret pfsense_host
    community: !secret pfsense_community
    baseoid: 1.3.6.1.2.1.4.22.1.2
    consider_home: 120

HOWEVER… this didn’t work at all… because the pfsense host is actually on another VLAN which I totally forgot about in the first place, so I added a new ethernet card to the LXC and to my surprise the problems started! I noticed the exact same behaviour I’m having with my ‘production’ HA: the /api/websocket endpoint will now work for about 36-39 seconds, no more data will be transmitted and the browser will (eventually) close the connection with a 1006 error.


Conclusions:

I did delete the second card from within proxmox and a reload later my HA-test and production instances did not fail again, so I’m pretty sure it’s related to this. The way I had it set up was:

root@home-assistant:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0@if23: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 06:28:21:9f:35:57 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.16.50.55/23 brd 172.16.51.255 scope global dynamic eth0
       valid_lft 6240sec preferred_lft 6240sec
3: eth1@if24: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 22:aa:68:cc:99:a9 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.16.10.215/24 brd 172.16.10.255 scope global dynamic eth1
       valid_lft 6344sec preferred_lft 6344sec

Whereas my docker-compose.yml file looks like:

version: "3"
services:
  homeassistant:
    container_name: homeassistant
    image: ghcr.io/home-assistant/home-assistant:stable
    restart: unless-stopped
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - ./config:/config
    network_mode: "host"

I did not test out what happens if I assign the ports manually instead of passing on the entire network_mode="host" but this would require some investigation on my part regarding the ports and protocols needed now and in the future.

That being said, I now also introduced firewall rules so that instead of allowing HA to reach out to my entire management VLAN, it can only access the services it needs on that VLAN (which are quite a few, since most of them go through the exact same transparent proxy the IOT network is not supposed to be able to reach out to in the first place). This is actually better this way. Now HA only needs access to one network instead of two.

This did not use to be an issue before, I’ve been using two interfaces ever since netplan became the default in Ubuntu (starting 20.10 I think), I believe the 2023.12 release did make some changes and perhaps it gets confused somehow? Can also be that docker did have some bad update in that case. Either way, I was able to solve my issues using only one network adapter instead of two.

Greetings.

1 Like