This is such a weird issue and I’ve never come across it before.
Basically, my HA server (HAOS, x86-64. Served outside via NGINX add-on) is losing at least part of its network most days. It seems to happen in the evenings, but that could be wrong—I’ll need to log them.
Symptoms
1. HA becomes inaccessible, but seems to keep running
I can’t access it either externally (https) or internally (via port 8123). It also appears that HA stops being able to send or receive on the network.
Screenshot 1
In this image, you can the broken or stuck lines are being collected from sensors using the network (wifi, external data source, etc.). These came back after I rebooted the host.
But you can also see the z-wave sensors that continued to work. These operate independently of the network.
2. SSH is sometimes accessible
Sometimes I’m able to access the box over SSH, but this doesn’t always happen. In the past, I remotely rebooted the box when possible and it came good.
3. CPU seems to do something odd in the hours leading up to the issue
Check out these two graphs.
Screenshot 2
You can see here that the CPU goes up just before midday and stays up like that for about eight hours. This occurred when I updated HA from 2025.5 to 2025.5.1, and see that the CPU usage remained up after I did.
When the CPU usage goes down is when the network issues start (see Screenshot 1). It looks like something crashes?
What about the memory usage: see how it starts sawtoothing after the issue start?
The changes at the end occur after I restarted the box.
Let’s take a closer look at CPU:
Screenshot 3
There’s definitely a spike that occurs at or just before midday. I don’t quite know what to make of it, if it’s a service that’s started to have issues or not.
Troubleshooting
1. Checking storage usage
I checked to make sure that I didn’t have issues with storage. There is plenty of free space (90%).
2. Restoring from backup
I booted into a GParted live session to run diagnostics on the SSD and erased it completely. No issues were reported with the SSD or the memory.
I imaged the SSD and restored from backup.
3. NGINX?
I’m not sure if there could be an issue with NGINX because I did start to host a Jellyfin through it. I thought maybe that was causing problems and I realised I hadn’t set proxy_buffering off
.
I changed that setting and it still has issues.
Either way, if NGINX was the issue, presumably I would still be able to access the box over SSH all the time and also on the LAN through 8123.
4. Supervisor + Logs?
I’m thinking there might be an issue with the supervisor that could be causing these issues.
I’ve pulled some logs today from a number of components on the box and I can’t see anything weird offhand.
Thoughts?
Any suggestions would be fantastics, thanks!