Having to restart docker container daily?

pembo · March 25, 2021, 7:47pm

Since moving from 2021.2.x to 2021.3.4 I’m having to restart the docker container pretty much once a day as it has issues accessing network resources.

I’ve not managed to locate what it is yet causing the issue but I’m expecting it to be related to some sort of hanging comms thats causing file descriptors to be consumed.

When I look at the container right now, I reckon htop lists about 40 - 50 python3 child processes:

Is this normal?

Only thing I can spot issue wise is continued failure of the Waze travel time integration:

I’m going to disable the waze integration and watch to see if this makes any difference

Has anybody else had similar issues since moving to 2021.3.x.
The only change I had to make when upgrading was the AsusWRT changes…

code-in-progress · March 26, 2021, 1:35am

I’ve seen quite a few network “dropouts” since upgrading to 2021.3.4. At least once or twice a day I get error messages that HA can’t reach a networked device for seemingly no reason at all and then it resolves itself with about 5-10 minutes.

Meanwhile, Zabbix doesn’t report any unavailable ping sensors. So, I’m fairly convinced it’s something within either HA itself or the Core docker container. I’ve not had a chance to really sit down and troubleshoot it though and I haven’t had to restart the container to get networking back up again.

pembo · March 26, 2021, 6:47am

Glad its not just me!

I took a look at the log, and I also see random networking dropouts for no reason also throughout yesterday, so it has to be something in the HA Core or image that has changed since 2021.2.x - these only started when I switched to the 2021.3.x release, and other containers on the server don’t seem to exhibit the problem.

I see no dropouts on nagios on the ping sensors either, on the HA side, or other containers, and indeed the HA container doesn’t show any unhealthy messages either.

I’ve just run an upgrade on the linux distro and will keep a further watch on it to see if that makes any difference and try to dig more if it gets into a mess again today at somepoint

Troon · March 26, 2021, 7:29am

Just as a data point and not as a boast, I’m on 2021.3.4 on my Synology NAS, and have had no such problems. I run it in host networking mode, with ten ESPHome devices and a Yamaha AV receiver as the main network traffic.

richieframe · March 26, 2021, 8:30am

Mine has 29 child processes, however I do not know what contributes to its count, so 50 may be totally normal if you have multiple integrations, and maybe some use more than others

I am on 2021.3.4 and have had no issues

pembo · March 26, 2021, 10:44am

Interesting… then I wonder if it’s related to one of the integrations I’m using in comparison to what you aren’t. I’m keeping an eye on it to see if/when it happens again, and assuming it does have an issue again (which I think it will!) I might have to up the logs and wait again.

Mines running on an intel nuc (ontop of a VM). Resources definitely aren’t an issue here, but somethings definitely changed since previous release thats introduced this - it’s just very difficult to pinpoint what exactly.

code-in-progress · March 26, 2021, 11:06am

I have a lot more than that. I have over 90 networked devices connected to my instance with a mix of UDP and TCP devices. I mean, yeah, it could be the number of devices I have for sure, but earlier versions didn’t really have this issue.

These are my current stats. What strikes me as odd is that only HA is not showing any network stats and, of course, has the highest amount of pids used.

pembo · March 26, 2021, 11:40am

mine is similar… equally I have quite a lot of devices as well, but previously this was all fine also!

pembo · March 26, 2021, 11:45am

@code-in-progress don’t suppose you’re an ASUSWRT user?
That was the only real change for me in this release…

code-in-progress · March 26, 2021, 11:53am

Nope. Unifi but I’m also on HA Core. So, my resource usage is a lot lower (no “add-ons”). I’m actually wondering if there’s something wonky in the docker config in the HA container (for both Supervised and Core) as I’ve never seen a docker container NOT report network usage.

I know it’s not the machine I’m running HA on as my resources are basically untouched and I have quite a few things running on this machine:

pembo · March 26, 2021, 11:56am

Rules out asuswrt then… I did a temporary apk update/upgrade on the container earlier and I’m watching to see if that makes any difference in case there’s something unusual happening on the debian side… but I’d be surprised also if that was it!

code-in-progress · March 26, 2021, 12:09pm

I wouldn’t. Docker has been weird since 19.x came out. Especially in the Debian/Ubuntu repos. I’m on docker 20.10.5 and it wouldn’t surprise me in the least if it turned out to be some weird docker thing…

pembo · March 26, 2021, 1:41pm

Funnily enough on my docker host, the upgraded OS packages were:

The following packages will be upgraded:
  containerd.io docker-ce docker-ce-cli
3 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

pembo · March 27, 2021, 4:30pm

Updated docker server, updated container OS… issues still occurring

Some checks

Container can still see the network - pinged google.com from inside the HA Core container, and run some shell scripts which provide sensors to HA
Plenty of CPU, Ram and swap capacity
Container still shows as healthy.

I killed HA in the container by accident so couldn’t dig further. For now, I’ll cron a daily restart of the container and check again in 2021.4.x

I can’t believe its only myself and @code-in-progress with the issues…

The network stats won’t show anything as its using the host networking btw…

code-in-progress · March 27, 2021, 5:02pm

Yeah, I have a suspicion that the host>docker network bridge driver is at issue. I’ve started sniffing the traffic going from my iot vlan to the docker host hoping I can see some sort of irregularities. Honestly, I kind of wonder if it might be UDP flooding that’s occurring as I have a metric crap-ton of UDP only devices (well over 80 of them currently) and I remember that being an issue with docker a while ago. I can’t find the issue that documented out the problem, though.

pembo · March 27, 2021, 7:16pm

On the opposite side, it’s only the HA-Core container that is having the issues… a lot of my device integrations go through node-red, which has heavy traffic and isn’t having the same probs, although that image/container hasn’t been updated recently.