Please help me figure out why networking seems to regularly fail on my HA box

This is such a weird issue and I’ve never come across it before.

Basically, my HA server (HAOS, x86-64. Served outside via NGINX add-on) is losing at least part of its network most days. It seems to happen in the evenings, but that could be wrong—I’ll need to log them.

Symptoms

1. HA becomes inaccessible, but seems to keep running

I can’t access it either externally (https) or internally (via port 8123). It also appears that HA stops being able to send or receive on the network.


Screenshot 1

In this image, you can the broken or stuck lines are being collected from sensors using the network (wifi, external data source, etc.). These came back after I rebooted the host.

But you can also see the z-wave sensors that continued to work. These operate independently of the network.

2. SSH is sometimes accessible

Sometimes I’m able to access the box over SSH, but this doesn’t always happen. In the past, I remotely rebooted the box when possible and it came good.

3. CPU seems to do something odd in the hours leading up to the issue

Check out these two graphs.


Screenshot 2

You can see here that the CPU goes up just before midday and stays up like that for about eight hours. This occurred when I updated HA from 2025.5 to 2025.5.1, and see that the CPU usage remained up after I did.

When the CPU usage goes down is when the network issues start (see Screenshot 1). It looks like something crashes?

What about the memory usage: see how it starts sawtoothing after the issue start?

The changes at the end occur after I restarted the box.

Let’s take a closer look at CPU:


Screenshot 3

There’s definitely a spike that occurs at or just before midday. I don’t quite know what to make of it, if it’s a service that’s started to have issues or not.

Troubleshooting

1. Checking storage usage

I checked to make sure that I didn’t have issues with storage. There is plenty of free space (90%).

2. Restoring from backup

I booted into a GParted live session to run diagnostics on the SSD and erased it completely. No issues were reported with the SSD or the memory.

I imaged the SSD and restored from backup.

3. NGINX?

I’m not sure if there could be an issue with NGINX because I did start to host a Jellyfin through it. I thought maybe that was causing problems and I realised I hadn’t set proxy_buffering off.

I changed that setting and it still has issues.

Either way, if NGINX was the issue, presumably I would still be able to access the box over SSH all the time and also on the LAN through 8123.

4. Supervisor + Logs?

I’m thinking there might be an issue with the supervisor that could be causing these issues.

I’ve pulled some logs today from a number of components on the box and I can’t see anything weird offhand.

Thoughts?

Any suggestions would be fantastics, thanks!

Update 1

Yesterday after I performed a power cycle, it took just several hours for the server to degrade again. Here’s what I noticed:

  • About 30% of pings were dropped
  • SSH would work, albeit very intermittently
  • Some TCP/IP packets would go out: I could turn on a light by z-wave button and it would work mostly, slowly

So the next step I took was to restore a backup from 3 May, when I knew for sure I had no issues.

From here:

  • I won’t update anything for a few days to make sure everything is stable
  • I’ll perform updates one by one in case an add on or version of HA core is the culprit
  • If all seems good, I’m going to get a backup of the borked state of HA and restore it on a VM to see if the condition is replicated on other hardware
    • If I can replicate this issue, I would be keen to work with some devs to isolate the issue and locate it

Analysis

While it’s too soon to tell, a corrupt configuration or database could be causing significant stability issues.

Another possibility is that something might have become corrupt in an area that is inaccessible to the [non-expert] user.

If this is correct, the fault is packed into backups and will break restored instances. This makes it very difficult or impossible to recover from.

This means that in lieu of a working backup, the only solution could be a total rebuild of the server. For users trying to save time with HA, this is an unacceptable solution.

Some preliminary thoughts for ways ahead:

  • From memory, the HA database refactors periodically.
    • Perhaps this should be invokable on demand if necessary.
  • Components outside HA core aren’t easily accessed by the user.
    • If these exhibit issues, there should be a way for the user to “refresh” (i.e. reset) them from the GUI as part of a troubleshooting wizard
  • Should HA display backups as snapshot branches to assist backup-based troubleshooting? (Like in VM managers…)
  • Should HA have a built-in troubleshooting wizard? E.g.:
    • When were things definitely working properly? Comparing backups…
    • Step 1: Restore recent backup, did that work?
    • Step 2: Refresh OS disk, did that work?
    • Step 3: Reinstall all add-ons, did that work?
    • Step 4: Refactor database, did that work?
    • Step 5: Refresh supervisor, did that work?
    • Step 6: Restore old backup, with option to import automations, scripts, scenes etc from current state

I must definitely say chapéu. Your problem description is way better than lots I read.

I am really not familiar with HAOS.

But what you want to check is the host itself.

So for instance

  • dmesg (ring buffer)
  • ifconfig or ip a. do you see errors?
  • check the logs in /var/log. for instance messages.
  • Do pings (ICMP) on the host.
  • check if you have a device with the same IP in the subnet. This would kick the device out for instance…

In fact there could be plenty of issues. Now its up to you to track it down.

I had similar issues with a USB Wifi adapter. Once I realized it was the adapter, I replaced it.