High CPU load hassio_dns Container

Go to the Supervisor / System page and check your DNS settings in the Host box.

1 Like

1.1.1.1

But you made my day, i checked also for IPv6 and deactivated it. And the cpu goes down to 0.1! Thank you!!!

My hassio_dns container has started going bonkers as of this morning, as well. It is using 100% of one core and also straining pfsense on another server. In glances, the container is showing a ton of tx/rx (5-6 Mb/s).

Anyone have any idea what is going on here? I’m running HAOS on proxmox.

I got the same things

CPU usage increasing a lot this morning 10 AM, maybe related to this release ? Release 2022.05.0 · home-assistant/supervisor · GitHub

Don’t know if supervisor is related to hassio_dns


EDIT : I disable the fallback dns method and everything is ok now.
(Requirement : SSH Addon min version : 9.4.0 / Supervisor min version 22.05.00)

4 Likes

That did it! Jesus Christ that was nuts. Any idea what was in all of that traffic? Was that just millions of dns requests? I’m so curious.

What do your logs say? Should be able to see logs for the dns container by going here and changing the logger to DNS.

Ahhhh that’s how you get the container logs. It’s full of connection refused errors on port 853 on the cloudflare servers, followed by some NS errors on a high port on 127.0.0.1.

I assume you have port 853 and/or cloudflare blocked on your network? That does seem to cause runaway DNS queries. Although I’m still not totally sure why. I stopped the runaway healthchecking with this pr but something else seems to kick in when it gets blocked…

I was rerouted all requests back to pfsense at one point, but I thought I removed that NAT rule. I’ll take another look when I get home.

edit: I was pointing home assistant to my pfsense instance for DNS, but I only had pfsense listening on port 53. I did not have the DNS over SSL/TLS option enabled, therefore it was not responding to requests on port 853. Maybe that was it? What’s strange was that it just started this morning unprovoked.

I think I got similar DNS setup as you Andy. I redirect all DNS queries from my LAN to the local DNS (:53) resolver of my router and I’m also not responding on :853

[core-ssh ~]$ ha dns info
fallback: true
host: 172.30.32.3
llmnr: true
locals:
- dns://X.X.X.1
mdns: true
servers: []
update_available: false
version: 2022.04.1
version_latest: 2022.04.1

ha dns logs

[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused
[INFO] 127.0.0.1:35512 - 55954 "NS IN . udp 17 false 512" NOERROR - 0 5.000312011s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused
[INFO] 127.0.0.1:41948 - 41961 "NS IN . udp 17 false 512" NOERROR - 0 5.000560433s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused
[INFO] 127.0.0.1:37715 - 45325 "NS IN . udp 17 false 512" NOERROR - 0 5.000469139s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: connect: connection refused
[INFO] 127.0.0.1:36386 - 22578 "NS IN . udp 17 false 512" NOERROR - 0 5.001260033s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused
[INFO] 127.0.0.1:33879 - 39043 "NS IN . udp 17 false 512" NOERROR - 0 5.003120018s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused
[INFO] 127.0.0.1:54263 - 10403 "NS IN . udp 17 false 512" NOERROR - 0 5.003841209s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused

CPU Usage

I’ve been having this issue too, quite frustrating as it bumped the power consumption of the host of my Hassio VM by about 20W from the excessive CPU consumption.

To resolve it for now I’ve ran the command to disable fallback, hope this helps anyone else.

The runaway DNS is a result of something screwy going on when CoreDNS is looking for the NXDOMAIN rcode in the responding packet, but all it gets is a RST packet. The result is infinitely forwarding requests to the fallback. Remove NXDOMAIN from the list of fallback rcodes and everything is stable again.

I’m still digging through the code to see what I can find, but that’s the issue. Also, fallback is deprecated and goes along with proxy which is no longer used. Instead, we should compile CoreDNS to use alternate and forward (already used). Nonetheless, the same issue exists so unfortunately not an easy fix like using the latest package.

1 Like

Interesting. So the reason nxdomain is in that list is because of this. So I don’t believe we can just remove it. That being said this is helpful for getting it fixed.

What I might do is try to reproduce it with alternate on coredns. Assuming I can I’ll submit a bug there. If they’re able to fix it then we can use it too. If not then when I get a chance I’ll look into it more.

I actually didn’t know about alternate though. I had thought the HA team wrote the code for fallback themselves, did not know it was a modified version of an existing external plugin. Good to know!

EDIT: btw, not trying to discourage you at all. If you can figure out how to fix it please submit a PR! Although the pr can’t just remove nxdomain from the list for the reason I linked above.

Hi @CentralCommand, I’ve been able to root cause the runaway DNS queries as a result of what I call infinite looping configuration in /etc/corefile. The lines of configuration that are contributing to this issue are loop, fallback REFUSED,NXDOMAIN, and max_fails 0. Here’s what happens:

  1. CoreDNS starts and executes the loop plugin, which sends a query for <random number>.<random number>.zone that gets forwarded externally and the Root Servers respond with NXDOMAIN.

  2. CoreDNS triggers the fallback plugin due to the NXDOMAIN response.

  3. CoreDNS now sends all queries, including health_check (NS IN .), to Cloudflare over TLS.

  4. User’s firewall blocks DNS over TLS (TCP 853).

  5. CoreDNS triggers the fallback plugin due to the REFUSED response.

  6. Since max_fails 0 is set, CoreDNS assumes Cloudflare is always healthy.

  7. CoreDNS is now in an infinite loop continuously sending and retrying its health_check query. :nauseated_face:


Initial thoughts for a PR to fix this are:

  1. HINFO query from the loop plugin should not trigger fallback.
  2. Allow users to specify whether or not to use DNS or TLS when configuring the plugin-dns container.
  3. Don’t assume Cloudflare will always be available.
  4. [Unrelated]: Remove policy sequential so not to overload a single user’s DNS server.

Although, before reworking this configuration, is the narrative on why plugin-dns exists in the first place is to provide continuous access to well-functioning and available DNS servers?

1 Like

@d0nni3q84 is there also fix for that issue?

seems this is the fix > Very high CPU usage for CoreDNS - #2 by Ohjay

Holy cow - I can’t believe I finally found the issue of high CPU usage. I was going crazy - 4x cores in a Xeon system running at 45-50% ALL THE TIME. Back to 1-3% finally.

Why is this even still an issue in the latest release, my power usage dropped over 50 watts instantly just due to this nonsense.

This is a fix for me too.

I have pfSense blocking DNS (53/853) except to itself.

Not sure why HA would go to CF for DNS when it has the correct server (GW) from DHCP assigned.

Hope this gets fixed in the future.

Yeah…

INFO] 127.0.0.1:42268 - 12241 "NS IN . udp 17 false 512" NOERROR - 0 5.000172555s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: connect: connection refused
[INFO] 127.0.0.1:54269 - 45423 "NS IN . udp 17 false 512" NOERROR - 0 5.000260564s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused
[INFO] 127.0.0.1:37725 - 1294 "NS IN . udp 17 false 512" NOERROR - 0 5.000177642s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused
[INFO] 127.0.0.1:52744 - 15367 "NS IN . udp 17 false 512" NOERROR - 0 5.000148847s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused
[INFO] 127.0.0.1:37220 - 35442 "NS IN . udp 17 false 512" NOERROR - 0 5.00029164s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: connect: connection refused
[INFO] 127.0.0.1:60441 - 23294 "NS IN . udp 17 false 512" NOERROR - 0 5.000075793s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused
[INFO] 172.30.32.1:43194 - 56666 "A IN lgwebostv.local.hass.io. udp 41 false 512" NXDOMAIN qr,aa,rd 41 0.000347154s
[INFO] 127.0.0.1:36765 - 771 "NS IN . udp 17 false 512" NOERROR - 0 5.000145671s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused
[INFO] 127.0.0.1:38426 - 30149 "NS IN . udp 17 false 512" NOERROR - 0 5.000116136s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused
[INFO] 127.0.0.1:46800 - 38111 "NS IN . udp 17 false 512" NOERROR - 0 5.000157747s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: connect: connection refused
[INFO] 127.0.0.1:43374 - 56561 "NS IN . udp 17 false 512" NOERROR - 0 5.000402746s
[ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: connect: connection refused

Used the command ha dns options --fallback=false in terminal and now it’s solved -.-

2 Likes

Same issue, current version of HA on a RaPi4.

This took a long time to track down, thanks for the fix, the fallback resolved it for me also.

can’t say I am in real trouble, but indeed now and then see those higher spikes.

Ive just checked and indeed, fallback: true is returned

would it be safe to disabled that point blank? Or, put differently, would it be helpful at all, tbh, I can not really find anything wrong in the DNS logs, only showing lines like

[INFO] 172.30.32.1:[port] - [1234] "A IN wlp3s0.local.hass.io. udp 38 false 512" NXDOMAIN qr,aa,rd 38 0.000170954s