Local DNS!

see my solution (using adGuard)

So, this is still broke, and apparently won’t get fixed. Heard that it is so that people who don’t do system administration can… checks notes… use a system that involves configuration and automation of several systems. Huh.

Well anyway, I would imagine if there are any people out there that are willing to do some network administration, you can probably make a rule in your firewall that redirects all port 53 (DNS) traffic to your chosen local DNS server and call it a day.

Edit: Even this doesn’t work. I have configured and tested NAT rules that intercept DNS, and DNSSEC (443 and 853). I have hardcoded the name into the hosts file. Nothing will allow me to use a local server for DNS resolution. HA wants to use 1.0.0.1:853 or 1.1.1.1:853 regardless, and won’t resolve local hostnames.

2 Likes

Wow, glad to hear that I’m not the only one with this ridiculous issue. I’ve tried both suggested edits to the corefile (followed by ha dns restart) but unfortunately neither worked. The dumb ways they often handle things at the container level will be the death of me. I can achieve short-name resolution at the host level (and throughout my entire network) but not in the container since it’s looking for a DNS suffix of “.local.hass.io” rather than my Edgerouter-4’s DNS of “.router.home” and doesn’t have the decency to ask my router. Specifying FQDN works at both levels with nslookup/ping but f that, I’m not using FQDNs in my config when it should be working properly with short names. The only way I could think to solve this based on what everyone’s tried so far is to set up a WINS Server in docker to auto-resolve with short names in another domain - but even if that’s possible and happens to work, that’s beyond stupid for a multitude of reasons. Hopefully the devs can wake up and fix this instead of defending such a mediocre implementation.

EDIT: I was able to change the line “fallback REFUSED,SERVFAIL,NXDOMAIN . dns://127.0.0.1:5553” and swapped out the “127.0.0.1:5553” with my router’s IP and specified Port 53 rather than 5553 and that seemed to give me full short name resolution across the board but of course that doesn’t survive a host reboot. If I can script this then it would suffice for a dirty workaround. - Nvm, that doesn’t work either. Resolves the name but still adds the .local.hass.io suffix. Ugh.

This is absolutely the most frustrating thing broken in HASS for me.

1 Like

Same on my side… this bug makes me crazy.

Hey all, I’m Mike Degatano. For anyone that watched the release party of 2022.3 you saw me on the stream, I was announced as Nabu Casa’s new hire focusing on supervisor. I have been working on this exact issue and was wondering if anyone facing this would be willing to try the beta channel. I have put in two changes that should help with this that are available there right now:

  1. MDNS to Systemd Resolved - essentially the DNS plugin is no longer attempting to resolve MDNS and LLMNR names itself. Instead it simply asks the native systemd-resolved service to do that for us. This should ensure .local names work properly, as long as the host can resolve the name then we can too.
  2. Cloudflare as a fallback only and no healthcheck - I changed up the corefile to remove Cloudflare from the list of forwarding servers. Instead it is only used as a fallback. This should keep it from getting “stuck” on cloudflare like a lot of you all have been seeing. Where it moves on from your listed local DNS servers after hitting issues and gets the idea that only the Cloudflare one works. It will always keep trying your local DNS servers first. And it no longer healthchecks Cloudflare at all to prevent runaway healthchecks when Cloudflare is blocked like some of you had reported in issues.

I’m not done here, there’s a few more changes to put in. But I thought those two might help you all from what I’ve been reading. So if any of you would be willing to give the beta a try and let me know if your experience is better or if there’s still issues to address I would really appreciate it.

14 Likes

How do we switch to the beta for hassio-dns? Do we have to switch the supervisor to beta?

Yes, that puts the whole install on the beta channel which means it will install the current supervisor beta. Sorry, there’s no way to only subscribe the one plugin to beta.

I think (pretty sure) you can force just the dns beta with
ha dns update --version 2022.04.0 (the only addon that won’t work for is supervisor itself if you are not on beta channel)

Oh yea good point. I don’t think that is sticky though, I believe supervisor will eventually realize that the wrong version of the DNS plugin is installed based on the channel it thinks its on and correct it. But it should last long enough for some testing.

I pulled beta dns when it was released and its still sticky to now

Yeah, I’ve done the same. Seems to be sticky for me too.

@CentralCommand I’ve been using your changes from plugin-dns PR #82 manually applied to the coredns template file. It’s been working much much better. Over the last week I’ve only had one “incident” of DNS giving up the ghost and my external URL based camera entities showing up as broken images on Lovelace dashboard. I have to restart HA to get them back, I’m not sure why temporary dns name failure gets stuck and won’t try again at some point. I’ll give a switch over to the beta if that still makes sense, otherwise if a release of this to production is coming I’ll hold until then. Thanks for the work on this.

It should come out of beta soon, most likely next week barring reports of issues.

I’m a bit disappointed it still got stuck. From my understanding of the forward plugin in coredns I don’t see how that could happen. Only your DNS servers are listed there now, cloudflare should only be tried if those fail. Was there anything in the log for the DNS plugin of note? You can get to that from here for reference. You can also turn on debug logging for supervisor by doing ha supervisor options --logging debug. That will turn on debug logging for the plugins as well which might give more insight.

One possibility I would like to check if you don’t mind (should be quick). Most of the containers used in a typical home assistant deployment are alpine based which means they use musl rather then glibc. This has an interesting consequence when it comes to DNS because of this commit. According to the DNS spec if a host exists a DNS server should always return NOERROR, even if they have no answers for the particular type of query (like if a host has only an ipv4 address but not an ipv6 one). musl enforces this, glibc does not. Because musl and alpine is newer not all DNS servers respect this rule and as a result unexpected NXDOMAINS can be encountered on alpine based systems.

We set up a simple test to check for this and are going to start testing DNS servers for this so we can let users know if their DNS server has an issue once this PR is merged. In the meantime you can test this manually by using the test domain we set up that only resolves for A type queries by doing this:

dig _checkdns.home-assistant.io A +noall +comments
dig _checkdns.home-assistant.io AAAA +noall +comments

Your response should have status: NOERROR for both of those, like this:

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6899

If not then that may be the issue. Since a query issued for a type not supported by one of your local domains could be getting back an NXDOMAIN and causing the DNS plugin (which is alpine based) to think the entire domain doesn’t exist. Home Assistant too since it is also alpine based.

@CentralCommand Very happy to see this starting to get sorted after 2 years of my requests getting constantly shot down

The only issue remaining as far as I can see is the ability to disable the fallback completely. I understand that this may well be unpopular, as it sort of defeats the purpose of the fallback as a final catch all. That being the case, the behaviour could be modified so as to only call the fallback after a SERVFAIL response. Currently we have:

fallback REFUSED . dns://127.0.0.1:5553
fallback SERVFAIL . dns://127.0.0.1:5553
fallback NXDOMAIN . dns://127.0.0.1:5553

However, REFUSED and NXDOMAIN are not errors, and the fallback should not be used when these messages are received.

I believe the fallback is also used (from my own observations) when a NOERROR with a NULL response is received…, again this is not and error, and the fallback should not be invoked.

Here is such an example where the fallback is called when a NOERROR is returned (the fallback fails because I have it redirected to a local service which fails because of the cert mismatch)

[INFO] 127.0.0.1:48415 - 56587 "AAAA IN api.viessmann.com. udp 46 true 2048" NOERROR - 0 1.028544507s
[ERROR] plugin/errors: 2 api.viessmann.com. AAAA: dial tcp 1.0.0.1:853: i/o timeout

Again, very happy to see the changes you have made so far. What do you think about just using the fallback on the SERVFAIL condition ?

3 Likes

So I actually specifically asked about those when I joined. The reason the fallback is used on NXDOMAIN and REFUSED is because of this

It seems there was a rash of bugs at one point and the root cause was that self-hosted DNS server software had a habit of returning NXDOMAIN or REFUSED for AAAA requests when a domain only resolves on A requests. This was extremely problematic because github.com and ghcr.io don’t resolve on AAAA requests, they only have an ipv4 address. So DNS servers that did this basically broke all of home assistant. Every alpine based container (i.e. nearly every image we ship) thought github.com and ghcr.io didn’t exist anymore and was completely unable to check for updates or pull most of our images (among other things but this was the most obvious).

This was common enough that it was actually the reason the fallback was added. SERVFAIL is kind of expected once in a while with a local dns server, people kind of get what happened when they see that in the log. But random NXDOMAIN responses for sites they know exist and did not block in their software? That is confusing.

That being said the plan actually is to add the ability to disable the fallback. It’s the next step after this PR. Once supervisor is able to detect this confusing situation and inform users about it then we’re comfortable giving a way to disable the fallback. We just want to make sure users with a DNS server that is going to create problems are aware of it first so we don’t get a rash of bugs again.

6 Likes

Great explanation, thanks !

One thing I would like to understand is why AAAA requests are generated by the platform when ipv6 is disabled, as in my case (or is ipv6 enabled between containers ?).

You can see in my last post an AAAA request is sent to my local DNS server (why ?) which does respond with NOERROR as the A domain does exist, but the system also sent an A request at the same time for the same domain… why ask for the AAAA record when ipv6 is unavailable at the network level ?

My HA DNS logs are also full of these ipv6 errors

[INFO] 127.0.0.1:50872 - 116 "A IN [::ffff:c0a8:ad3].local.hass.io. udp 60 true 2048" NOERROR - 0 0.00885769s
[ERROR] plugin/errors: 2 [::ffff:c0a8:ad3].local.hass.io. A: plugin/forward: no next plugin found
[INFO] 172.30.32.1:33177 - 116 "A IN [::ffff:c0a8:ad3].local.hass.io. udp 49 false 512" SERVFAIL qr,rd 49 0.02655286s
[INFO] 172.30.32.1:33177 - 116 "A IN [::ffff:c0a8:ad3].local.hass.io. udp 49 false 512" SERVFAIL qr,aa,rd 49 0.000467031s
[INFO] 172.30.32.1:33177 - 116 "A IN [::ffff:c0a8:ad3].local.hass.io. udp 49 false 512" SERVFAIL qr,aa,rd 49 0.00052328s
[INFO] 172.30.32.1:33177 - 116 "A IN [::ffff:c0a8:ad3].local.hass.io. udp 49 false 512" SERVFAIL qr,aa,rd 49 0.000452082s
[INFO] 172.30.32.1:33177 - 116 "A IN [::ffff:c0a8:ad3].local.hass.io. udp 49 false 512" SERVFAIL qr,aa,rd 49 0.001225571s
[INFO] 172.30.32.1:33177 - 116 "A IN [::ffff:c0a8:ad3].local.hass.io. udp 49 false 512" SERVFAIL qr,aa,rd 49 0.000536457s

Which I also don’t understand, it’s asking for the A record of an ipv6 address…

Tbh, I’m not sure. I don’t believe we are specifically doing something to force ipv6. I see it as disabled when I inspect the docker networks on my HAOS systems and I also have it disabled in nm config. Although I notice despite that when I ssh into the host and ask systemd-resolved to query homeassistant.local I still see a bunch of link-local IPv6 answers in addition to the ipv4 ones I would expect:

# resolvectl query homeassistant.local
homeassistant.local: 192.168.1.5               -- link: eth0
                     172.30.32.1               -- link: hassio
                     172.17.0.1                -- link: docker0
                     fe80::42:78ff:fe5c:1acb%4 -- link: hassio
                     fe80::42:a5ff:fe35:86d9%5 -- link: docker0
                     fe80::e46d:53ff:fe87:eeda%7 -- link: veth11e2789
                     fe80::2852:27ff:fe10:c27a%9 -- link: vethae5cb0b
                     fe80::70a2:81ff:feb6:b324%11 -- link: vethbf2156e

Our stack is pretty complicated and we rely on a lot of underlying technology for basic functionality like DNS. So I can try and figure it out but it might be tough to figure out what exactly is firing off AAAA queries and why.

The other thing I think is worth noting is a lot of people with ipv6 disabled are likely going to be changing their position on that in the (hopefully) near future. I’m not sure if you’re aware but Matter and Thread actually depend on ipv6. I have a lot of catching up to do in this space but I know the folks working on ensuring HA is ready for Matter and Thread mentioned this a few times. I’ve been meaning to enable ipv6 on my network to get ahead of this and any potential issues but haven’t gotten around to it yet.

That’s pretty weird. Not really sure what to make of that. I’ll put it on my list to try and figure out what that is coming from. Doesn’t really make sense as a query.

1 Like

Have seen a little on Thread & Matter, just assumed HA would use NAT64 if ipv6 was unavailable on the host side of the network. My ISP doesn’t support ipv6 and neither does my router :frowning:

Maybe. Like I said, I also have a lot of learning to do in this space. Sorry didn’t mean to get off topic, really just meant to point out that it was going to become more important soon.

2 Likes