Consistent lag

Recently I purchased a HA Yellow (POE) and paired it with a CM5 (Wireless/16Gb/64Gb). Originally I installed HA on the CM5 by following this method, which eventually worked the way I’d expect. With HA up and running, I started the tedious process of migrating my old HA instance to this new one, copying over scripts, automations, scenes, etc…

Eventually had most everything copied over, and began using it as my primary instance.

A few days after that I started to notice, what appeared like I/O wait lag. The system would appear to randomly hang, for no good reason. I checked iostat and there’s basically 0 I/O wait. Even htop shows the system is mostly idle 99% of the time.

I then installed a brand new Crucial 500Gb m.2 SSD; the system detected it, and I moved my data disk to the SSD. I did this because I wanted to run off of the larger SSD, and preserve the lifespan of the emmc. Even after moving to this disk, the odd lag remained.

It’s been a week, and navigating between pages seems faster with the SSD than it did with the emmc, though graph data for sensors takes ~5-10sec to load. Even if I select settings on an automation, sometimes it comes right up, other times it is completely blank, and some times it doesn’t even respond at all. Even pressing a hot-key in the UI is delayed by a good ~5sec, if it detects the hot-key press at all. Both the browser & mobile apps regularly say “Connection lost, reconnecting…”. Even running a simple action, like turning off a switch is delayed… it’s very odd. Sometimes it’s snappy and quick, other times the page just times out, and fails to load at all.

What I’ve tried…
Disabled Nginx proxy manager add-on, so I’m going straight to the IP:8123
Disabled all and any unnecessary add-ons
Disabled all non-crucial integrations
Validated all my migrated automations, scripts, scenes, etc were valid, and pointed to actual devices/entities that existed
Rebooted many times

More info; the Yellow is plugged directly into a Unifi POE switch, with a 1’ cat6e cable, power and network connectivity (from the unifi standpoint) seem to be constant and stable. On average it’s running at about 18% cpu usage, and about 135ºF
I’m showing 235 devices, granted 98% of those are not physical, and 1,890 entities, and around 74 integrations.
A little background, I’ve been running HA for about 9yrs, and have run it on everything from a RPi 3, Odriod, docker, Vm, RPi4, etc… This CM5/SSD has the highest specs, yet the worst performance, for all intents and purposes it should be running the best, but it’s the worst performance I’ve ever seen.

I’m running out of ideas on what to troubleshoot next.
Happy to provide redacted configs, screenshots, whatever. Would really like to get to the bottom of the issue and get it resolved. Thank you!

Were you running same integrations and same number of devices?

Could it be network related? I have area in home with poor WiFi and I get lag with HA there but confuse for zwave lag issue. Could it be similar error. What is result with remote connection

Is it specific to integration type? Zwave only or other.

Definitely sounds network related

Try longer cable. I previously read you may get reflection noise or other problem with short cable. It also may be defective cable or ethernet port issue. Maybe try another port on switch.

Could it be that the RPi5 connects to the 5GHz WiFi and that is not that well suited in that location?

Wifi is completely unused at this point, it’s not configured or connected in any way.

@tmjpugh I’m actually using about the same number of integrations, but a lot fewer devices. At this point the only physical devices that I’m controlling are a Roborock, some Lifx bulbs, a couple Dyson fans, a couple ESPhome BT proxies, and a Tuya fan. Not currently using Zwave, Zigbee

It likely is network related, though I can’t and don’t understand where, or why.
All of my physical devices connect via wifi, though I’m able to read sensor data from them without disruption. And all of the devices stay connected to wifi from what I can tell (from Unifi console), no dropped packets, no signal loss…

I can try a longer cable, though all my cables are CAT6, and shielded, and no other device has this kind of issue, and it’s all relatively close, physically.

If I were to guess, it seems like it’s network related, but on the host level, like between docker containers- which shouldn’t have any issue since they’re all bridged together internally.

I tried shielded cat6 and it caused nothing but issues. Similar to what you report. Really weird. Never bothered to figure out why, just swapped out cables.

Not sure of equipment age but corrosion at port could create issues as well

How?
You have compose file?

Unifi doesnt really show temporary loss of connection. You could create sensor for network connection in HA. That may show disconnect/reconnect with accuracy

Try this. In the UI, click your user profile at the bottom left of the side bar. Then, in your profile page, check that Advanced mode is on - if it’s not, turn it on (hard refresh or clear your browser cache just in case).
Then, in the same profile page, scroll down to the Browser settings section - you’ll see a “Automatically close connection” option. Turn it off and your connection should no longer disconnect after 5 minutes.

@tmjpugh Trying a longer cat5e cable now, and moved the Yellow a little further away, issue seems to remain. Main switch is maybe 5yrs old- no other issues with any other hosts.

No, running HAOS, where HA and other add-ons are all docker containers; it’s transparent to the user, but if you log into the host via ssh, everything’s running as a container, including HA. There is no compose file.

I’ll try the sensor route, see if I can catch it that way.

@ShadowFist
Tried that on both browser & mobile, seemed to help, but issue remains.

It’s acting like there’s an IO wait issue, like it can’t access a resource when requested, and just hangs for ~5-15sec.

Here’s iostat running against the nvme (primary storage) when I’ve got a loading page in the UI.

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
nvme0n1           0.00         0.00         0.00          0          0
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
nvme0n1           0.00         0.00         0.00          0          0
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
nvme0n1           0.00         0.00         0.00          0          0
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
nvme0n1           0.00         0.00         0.00          0          0
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
nvme0n1           0.00         0.00         0.00          0          0
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
nvme0n1           3.96         0.00       926.73          0        936
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
nvme0n1           0.00         0.00         0.00          0          0
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
nvme0n1           0.00         0.00         0.00          0          0
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
nvme0n1           0.00         0.00         0.00          0          0
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
nvme0n1           0.00         0.00         0.00          0          0
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
nvme0n1           0.00         0.00         0.00          0          0
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
nvme0n1           0.00         0.00         0.00          0          0
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
nvme0n1           0.00         0.00         0.00          0          0

Pinging the hass container from HAOS

764 packets transmitted, 764 packets received, 0% packet loss
round-trip min/avg/max = 0.056/0.096/0.582 ms

Disabled almost every add-on, and 52 integrations, rebooted the Yellow (not just restarted HA), cleared all caches…

Now everything seems to be running smoothly! Guess I’ll need to go through one by one and re-enable them, to see what the culprit was.

Any suggestions on order? Add-ons, integrations, integrations dependent on physical devices, cloud based integrations, once that do calculations (like thermal comfort)?

I doubt official will cause this so start with HACS or custom integrations.