Recently I purchased a HA Yellow (POE) and paired it with a CM5 (Wireless/16Gb/64Gb). Originally I installed HA on the CM5 by following this method, which eventually worked the way I’d expect. With HA up and running, I started the tedious process of migrating my old HA instance to this new one, copying over scripts, automations, scenes, etc…
Eventually had most everything copied over, and began using it as my primary instance.
A few days after that I started to notice, what appeared like I/O wait lag. The system would appear to randomly hang, for no good reason. I checked iostat
and there’s basically 0 I/O wait. Even htop shows the system is mostly idle 99% of the time.
I then installed a brand new Crucial 500Gb m.2 SSD; the system detected it, and I moved my data disk to the SSD. I did this because I wanted to run off of the larger SSD, and preserve the lifespan of the emmc. Even after moving to this disk, the odd lag remained.
It’s been a week, and navigating between pages seems faster with the SSD than it did with the emmc, though graph data for sensors takes ~5-10sec to load. Even if I select settings on an automation, sometimes it comes right up, other times it is completely blank, and some times it doesn’t even respond at all. Even pressing a hot-key in the UI is delayed by a good ~5sec, if it detects the hot-key press at all. Both the browser & mobile apps regularly say “Connection lost, reconnecting…”. Even running a simple action, like turning off a switch is delayed… it’s very odd. Sometimes it’s snappy and quick, other times the page just times out, and fails to load at all.
What I’ve tried…
Disabled Nginx proxy manager add-on, so I’m going straight to the IP:8123
Disabled all and any unnecessary add-ons
Disabled all non-crucial integrations
Validated all my migrated automations, scripts, scenes, etc were valid, and pointed to actual devices/entities that existed
Rebooted many times
More info; the Yellow is plugged directly into a Unifi POE switch, with a 1’ cat6e cable, power and network connectivity (from the unifi standpoint) seem to be constant and stable. On average it’s running at about 18% cpu usage, and about 135ºF
I’m showing 235 devices, granted 98% of those are not physical, and 1,890 entities, and around 74 integrations.
A little background, I’ve been running HA for about 9yrs, and have run it on everything from a RPi 3, Odriod, docker, Vm, RPi4, etc… This CM5/SSD has the highest specs, yet the worst performance, for all intents and purposes it should be running the best, but it’s the worst performance I’ve ever seen.
I’m running out of ideas on what to troubleshoot next.
Happy to provide redacted configs, screenshots, whatever. Would really like to get to the bottom of the issue and get it resolved. Thank you!