i made a fresh installation last week and migrate from Raspberry to a x86 based image. The load times are really high, so invest some time to find out more informations… I searched also here in the forum, on google but nothing helped. I disabled addons, disabled integrations and so on. The cpu load is still high.
My hassio_dns container has started going bonkers as of this morning, as well. It is using 100% of one core and also straining pfsense on another server. In glances, the container is showing a ton of tx/rx (5-6 Mb/s).
Anyone have any idea what is going on here? I’m running HAOS on proxmox.
Ahhhh that’s how you get the container logs. It’s full of connection refused errors on port 853 on the cloudflare servers, followed by some NS errors on a high port on 127.0.0.1.
I assume you have port 853 and/or cloudflare blocked on your network? That does seem to cause runaway DNS queries. Although I’m still not totally sure why. I stopped the runaway healthchecking with this pr but something else seems to kick in when it gets blocked…
I was rerouted all requests back to pfsense at one point, but I thought I removed that NAT rule. I’ll take another look when I get home.
edit: I was pointing home assistant to my pfsense instance for DNS, but I only had pfsense listening on port 53. I did not have the DNS over SSL/TLS option enabled, therefore it was not responding to requests on port 853. Maybe that was it? What’s strange was that it just started this morning unprovoked.
I think I got similar DNS setup as you Andy. I redirect all DNS queries from my LAN to the local DNS (:53) resolver of my router and I’m also not responding on :853
[core-ssh ~]$ ha dns info
fallback: true
host: 172.30.32.3
llmnr: true
locals:
- dns://X.X.X.1
mdns: true
servers: []
update_available: false
version: 2022.04.1
version_latest: 2022.04.1
I’ve been having this issue too, quite frustrating as it bumped the power consumption of the host of my Hassio VM by about 20W from the excessive CPU consumption.
To resolve it for now I’ve ran the command to disable fallback, hope this helps anyone else.
The runaway DNS is a result of something screwy going on when CoreDNS is looking for the NXDOMAIN rcode in the responding packet, but all it gets is a RST packet. The result is infinitely forwarding requests to the fallback. Remove NXDOMAIN from the list of fallback rcodes and everything is stable again.
I’m still digging through the code to see what I can find, but that’s the issue. Also, fallback is deprecated and goes along with proxy which is no longer used. Instead, we should compile CoreDNS to use alternate and forward (already used). Nonetheless, the same issue exists so unfortunately not an easy fix like using the latest package.
Interesting. So the reason nxdomain is in that list is because of this. So I don’t believe we can just remove it. That being said this is helpful for getting it fixed.
What I might do is try to reproduce it with alternate on coredns. Assuming I can I’ll submit a bug there. If they’re able to fix it then we can use it too. If not then when I get a chance I’ll look into it more.
I actually didn’t know about alternate though. I had thought the HA team wrote the code for fallback themselves, did not know it was a modified version of an existing external plugin. Good to know!
EDIT: btw, not trying to discourage you at all. If you can figure out how to fix it please submit a PR! Although the pr can’t just remove nxdomain from the list for the reason I linked above.
Hi @CentralCommand, I’ve been able to root cause the runaway DNS queries as a result of what I call infinite looping configuration in /etc/corefile. The lines of configuration that are contributing to this issue are loop, fallback REFUSED,NXDOMAIN, and max_fails 0. Here’s what happens:
CoreDNS starts and executes the loop plugin, which sends a query for <random number>.<random number>.zone that gets forwarded externally and the Root Servers respond with NXDOMAIN.
CoreDNS triggers the fallback plugin due to the NXDOMAIN response.
CoreDNS now sends all queries, including health_check (NS IN .), to Cloudflare over TLS.
User’s firewall blocks DNS over TLS (TCP 853).
CoreDNS triggers the fallback plugin due to the REFUSED response.
Since max_fails 0 is set, CoreDNS assumes Cloudflare is always healthy.
CoreDNS is now in an infinite loop continuously sending and retrying its health_check query.
Initial thoughts for a PR to fix this are:
HINFO query from the loop plugin should not trigger fallback.
Allow users to specify whether or not to use DNS or TLS when configuring the plugin-dns container.
Don’t assume Cloudflare will always be available.
[Unrelated]: Remove policy sequential so not to overload a single user’s DNS server.
Although, before reworking this configuration, is the narrative on why plugin-dns exists in the first place is to provide continuous access to well-functioning and available DNS servers?
Holy cow - I can’t believe I finally found the issue of high CPU usage. I was going crazy - 4x cores in a Xeon system running at 45-50% ALL THE TIME. Back to 1-3% finally.
Why is this even still an issue in the latest release, my power usage dropped over 50 watts instantly just due to this nonsense.