Local DNS!

I pulled beta dns when it was released and its still sticky to now

Yeah, I’ve done the same. Seems to be sticky for me too.

@CentralCommand I’ve been using your changes from plugin-dns PR #82 manually applied to the coredns template file. It’s been working much much better. Over the last week I’ve only had one “incident” of DNS giving up the ghost and my external URL based camera entities showing up as broken images on Lovelace dashboard. I have to restart HA to get them back, I’m not sure why temporary dns name failure gets stuck and won’t try again at some point. I’ll give a switch over to the beta if that still makes sense, otherwise if a release of this to production is coming I’ll hold until then. Thanks for the work on this.

It should come out of beta soon, most likely next week barring reports of issues.

I’m a bit disappointed it still got stuck. From my understanding of the forward plugin in coredns I don’t see how that could happen. Only your DNS servers are listed there now, cloudflare should only be tried if those fail. Was there anything in the log for the DNS plugin of note? You can get to that from here for reference. You can also turn on debug logging for supervisor by doing ha supervisor options --logging debug. That will turn on debug logging for the plugins as well which might give more insight.

One possibility I would like to check if you don’t mind (should be quick). Most of the containers used in a typical home assistant deployment are alpine based which means they use musl rather then glibc. This has an interesting consequence when it comes to DNS because of this commit. According to the DNS spec if a host exists a DNS server should always return NOERROR, even if they have no answers for the particular type of query (like if a host has only an ipv4 address but not an ipv6 one). musl enforces this, glibc does not. Because musl and alpine is newer not all DNS servers respect this rule and as a result unexpected NXDOMAINS can be encountered on alpine based systems.

We set up a simple test to check for this and are going to start testing DNS servers for this so we can let users know if their DNS server has an issue once this PR is merged. In the meantime you can test this manually by using the test domain we set up that only resolves for A type queries by doing this:

dig _checkdns.home-assistant.io A +noall +comments
dig _checkdns.home-assistant.io AAAA +noall +comments

Your response should have status: NOERROR for both of those, like this:

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6899

If not then that may be the issue. Since a query issued for a type not supported by one of your local domains could be getting back an NXDOMAIN and causing the DNS plugin (which is alpine based) to think the entire domain doesn’t exist. Home Assistant too since it is also alpine based.

@CentralCommand Very happy to see this starting to get sorted after 2 years of my requests getting constantly shot down

The only issue remaining as far as I can see is the ability to disable the fallback completely. I understand that this may well be unpopular, as it sort of defeats the purpose of the fallback as a final catch all. That being the case, the behaviour could be modified so as to only call the fallback after a SERVFAIL response. Currently we have:

fallback REFUSED . dns://127.0.0.1:5553
fallback SERVFAIL . dns://127.0.0.1:5553
fallback NXDOMAIN . dns://127.0.0.1:5553

However, REFUSED and NXDOMAIN are not errors, and the fallback should not be used when these messages are received.

I believe the fallback is also used (from my own observations) when a NOERROR with a NULL response is received…, again this is not and error, and the fallback should not be invoked.

Here is such an example where the fallback is called when a NOERROR is returned (the fallback fails because I have it redirected to a local service which fails because of the cert mismatch)

[INFO] 127.0.0.1:48415 - 56587 "AAAA IN api.viessmann.com. udp 46 true 2048" NOERROR - 0 1.028544507s
[ERROR] plugin/errors: 2 api.viessmann.com. AAAA: dial tcp 1.0.0.1:853: i/o timeout

Again, very happy to see the changes you have made so far. What do you think about just using the fallback on the SERVFAIL condition ?

3 Likes

So I actually specifically asked about those when I joined. The reason the fallback is used on NXDOMAIN and REFUSED is because of this

It seems there was a rash of bugs at one point and the root cause was that self-hosted DNS server software had a habit of returning NXDOMAIN or REFUSED for AAAA requests when a domain only resolves on A requests. This was extremely problematic because github.com and ghcr.io don’t resolve on AAAA requests, they only have an ipv4 address. So DNS servers that did this basically broke all of home assistant. Every alpine based container (i.e. nearly every image we ship) thought github.com and ghcr.io didn’t exist anymore and was completely unable to check for updates or pull most of our images (among other things but this was the most obvious).

This was common enough that it was actually the reason the fallback was added. SERVFAIL is kind of expected once in a while with a local dns server, people kind of get what happened when they see that in the log. But random NXDOMAIN responses for sites they know exist and did not block in their software? That is confusing.

That being said the plan actually is to add the ability to disable the fallback. It’s the next step after this PR. Once supervisor is able to detect this confusing situation and inform users about it then we’re comfortable giving a way to disable the fallback. We just want to make sure users with a DNS server that is going to create problems are aware of it first so we don’t get a rash of bugs again.

6 Likes

Great explanation, thanks !

One thing I would like to understand is why AAAA requests are generated by the platform when ipv6 is disabled, as in my case (or is ipv6 enabled between containers ?).

You can see in my last post an AAAA request is sent to my local DNS server (why ?) which does respond with NOERROR as the A domain does exist, but the system also sent an A request at the same time for the same domain… why ask for the AAAA record when ipv6 is unavailable at the network level ?

My HA DNS logs are also full of these ipv6 errors

[INFO] 127.0.0.1:50872 - 116 "A IN [::ffff:c0a8:ad3].local.hass.io. udp 60 true 2048" NOERROR - 0 0.00885769s
[ERROR] plugin/errors: 2 [::ffff:c0a8:ad3].local.hass.io. A: plugin/forward: no next plugin found
[INFO] 172.30.32.1:33177 - 116 "A IN [::ffff:c0a8:ad3].local.hass.io. udp 49 false 512" SERVFAIL qr,rd 49 0.02655286s
[INFO] 172.30.32.1:33177 - 116 "A IN [::ffff:c0a8:ad3].local.hass.io. udp 49 false 512" SERVFAIL qr,aa,rd 49 0.000467031s
[INFO] 172.30.32.1:33177 - 116 "A IN [::ffff:c0a8:ad3].local.hass.io. udp 49 false 512" SERVFAIL qr,aa,rd 49 0.00052328s
[INFO] 172.30.32.1:33177 - 116 "A IN [::ffff:c0a8:ad3].local.hass.io. udp 49 false 512" SERVFAIL qr,aa,rd 49 0.000452082s
[INFO] 172.30.32.1:33177 - 116 "A IN [::ffff:c0a8:ad3].local.hass.io. udp 49 false 512" SERVFAIL qr,aa,rd 49 0.001225571s
[INFO] 172.30.32.1:33177 - 116 "A IN [::ffff:c0a8:ad3].local.hass.io. udp 49 false 512" SERVFAIL qr,aa,rd 49 0.000536457s

Which I also don’t understand, it’s asking for the A record of an ipv6 address…

Tbh, I’m not sure. I don’t believe we are specifically doing something to force ipv6. I see it as disabled when I inspect the docker networks on my HAOS systems and I also have it disabled in nm config. Although I notice despite that when I ssh into the host and ask systemd-resolved to query homeassistant.local I still see a bunch of link-local IPv6 answers in addition to the ipv4 ones I would expect:

# resolvectl query homeassistant.local
homeassistant.local: 192.168.1.5               -- link: eth0
                     172.30.32.1               -- link: hassio
                     172.17.0.1                -- link: docker0
                     fe80::42:78ff:fe5c:1acb%4 -- link: hassio
                     fe80::42:a5ff:fe35:86d9%5 -- link: docker0
                     fe80::e46d:53ff:fe87:eeda%7 -- link: veth11e2789
                     fe80::2852:27ff:fe10:c27a%9 -- link: vethae5cb0b
                     fe80::70a2:81ff:feb6:b324%11 -- link: vethbf2156e

Our stack is pretty complicated and we rely on a lot of underlying technology for basic functionality like DNS. So I can try and figure it out but it might be tough to figure out what exactly is firing off AAAA queries and why.

The other thing I think is worth noting is a lot of people with ipv6 disabled are likely going to be changing their position on that in the (hopefully) near future. I’m not sure if you’re aware but Matter and Thread actually depend on ipv6. I have a lot of catching up to do in this space but I know the folks working on ensuring HA is ready for Matter and Thread mentioned this a few times. I’ve been meaning to enable ipv6 on my network to get ahead of this and any potential issues but haven’t gotten around to it yet.

That’s pretty weird. Not really sure what to make of that. I’ll put it on my list to try and figure out what that is coming from. Doesn’t really make sense as a query.

1 Like

Have seen a little on Thread & Matter, just assumed HA would use NAT64 if ipv6 was unavailable on the host side of the network. My ISP doesn’t support ipv6 and neither does my router :frowning:

Maybe. Like I said, I also have a lot of learning to do in this space. Sorry didn’t mean to get off topic, really just meant to point out that it was going to become more important soon.

2 Likes

Thank you so much for taking over this. Really appreciate and would be happy to support!

5 Likes

Hi Mike, I’ve only just discovered your post via a post by balloob on reddit. Just to let you know I have been running the supervisor 2022.4.2 beta pretty much since it came out a week ago. Zero issues here. :+1:

1 Like

FYI all I’ve added an option to disable the fallback DNS to the beta channel. You will actually need to be on the beta channel to try it out in this case as it involved updates to multiple components:

  1. Add a new option to enable/disable the fallback DNS to supervisor’s API
  2. Update the CLI to support this new option
  3. Update the DNS plugin to understand this new option

If you want to try it out I would appreciate the feedback. If you would prefer not to switch to the beta channel I understand. Assuming no issues it should make it to stable soon.

To use it is a bit tricky at the moment because the SSH add-ons will need to update to use the new version of the CLI. That won’t happen until it is available in the stable channel. So what you’ll need to do is SSH into one of the addons and then call supervisor’s API directly like this:

curl -X POST http://supervisor/dns/options \
  -H "Authorization: Bearer $(printenv SUPERVISOR_TOKEN)" \
  -H "Content-Type: application/json" -d '{"fallback": false }'

You can confirm it worked with the following command, fallback should be false in the output:

> ha dns info
fallback: false
host: 172.30.32.3
llmnr: true
...

The info command is able to show this because it simply prints everything in the output even if its something new its never seen before (like fallback). The options command does require an update to add support for new flags. Once this change does make it to stable and the SSH add-ons are updated then you will simply be able to do this to enable/disable the fallback DNS:

ha dns options --fallback=false

I also want to note that doing this will not necessarily mark your system as unsupported. It will only mark your system as unsupported if we detect an issue with one of the other DNS servers you provided. The two issues we are looking for are:

  1. DNS server unable to resolve A query on _checkdns.home-assistant.io
  2. DNS server returns status other then NOERROR for AAAA query on _checkdns.home-assistant.io. See here for why that is a problem.

If you have one of these issues you will see it when you execute the following command:

ha resolution info

The fallback guaranteed us a DNS server able to resolve the names we need it to correctly. Without that enabled you will need to make sure yours do or else you will likely hit unexpected issues. But as long as your provided DNS servers are working correctly the system is considered supported with or without the fallback DNS.

I do want to note though that this is new territory for us. It’s possible we’ll have to add more checks for DNS server requirements in the future if we see new issues come up with people that have disabled the fallback DNS.

8 Likes

Wanted to give everyone a heads up, supervisor is being updated now. 2022.05.0 is being pushed out to stable which includes the option to disable the fallback DNS. Additionally both SSH addons (the official one and the one in community) have been updated with the latest CLI version that includes support for disabling the fallback. Which means you can ignore all the curl stuff above and just type this:

ha dns options --fallback=false

Although I would recommend running ha resolution info first to make sure that supervisor did not identify any issues with your configured DNS servers. These checks do run on a schedule so if you want to be sure the check ran you can either wait a couple hours or force it to run immediately by doing ha resolution healthcheck.

One minor catch, the DNS plugin also needs an update to acknowledge and support this new option. We’re going to wait about 8 hours from now to merge https://github.com/home-assistant/version/pull/223 so that supervisor has updated everywhere first. Once this second PR goes in and your DNS plugin updates then you’re free to disable the fallback DNS if you want.

8 Likes

I also have the issue that .local adresses can’t be resolved since yesterday (probably ha update).
Running home assistant 2022.04.01 with CoreDns 1.8.4

I have a fritzbox that’s setup to use a seperate RPI with PiHole.
Whenever i want to use for example tv.local in homeAssistant there is no request made to PiHole and it just times out. When i change to tv.de it all works fine.

I openend an issue here:

I tried using ha dns options --fallback=false but it didn’t help

I’m using HA OS and updated to 2022.5.0 this morning as well as Supervisor 2022.05.0. I wasn’t having any issues using local hostnames in the configuration prior to today. I tried going to my backup of 2022.04.07, but still having issues, so I don’t think it is specifically 2022.5.0 that is the issue.

I have an Adguardhome LXC running on the network with DNS Rewrites for all the local hostnames and like I said that has been working fine until today.

I have gone through all the sites I can find for a solution. Even added the DNSMasq add-on to see if HA would work better with that. I checked in ha dns info, both local and servers include the 192. address for the AdGuardHome server. I also set fallback to false. Still could not get configs to work with hostnames, so I installed tcpdump on the HA OS filter on udp port 53 and noticed a couple of things

  1. if I run nslookup mosquitto.jnetinc.local, it fails out and tcpdump shows traffic between core-ssh.local.hass.io.39905 > hassio_dns.hassio.53
  2. if I run nslookup mosquitto.jnetinc.local 192.168.X.X using the AdGuardHome IP, I get the appropriate response and tcpdump shows the traffic between core-ssh.local.hass.io and 192.168.X.X
  3. If I go in to HA Dashboard and configure the MQTT using mosquitto.jnetinc.local as the hostname, it fails to connect, tcpdump doesn’t show any traffic at all, so it’s like it’s not even asking

Am I missing something?

In adguard home did you add a DNS rewrite for mosquitto.jnetinc.local?

Yes, they have been there for a while and I was using the hostname in the MQTT configuration without an issue for several months now. Just stopped working today when I restarted Core as part of the 2022.5.0 update. All configs using local hostnames stopped working. It’s possible to use the IP addresses for some, but that’s messy. And there are a few that use certificates that fail as they are tied to the hostname.

image

1 Like

Yes so that actually won’t work anymore. People submitted CVE-2020-36517 saying that we were forwarding local domain names inappropriately so we did an audit. One of the things we realized that we were in fact handling .local poorly. This was changed in this PR:

We looked at what systemd-resolved did for handling LLMNR and MDNS names and noticed that it flatly refused to forward these types of queries to resolvers as they are reserved for LLMNR and MDNS, you can see that code here. So now we do this as well.

If you need to use a DNS rewrite for .local then something is not working properly in your network. .local is exclusively for mdns and DNS resolvers are not supposed to answer queries for those. Something either isn’t broadcasting mdns correctly or is handling mdns queries incorrectly.

Or alternatively you can switch to using a domain like .lan or one of the other reserved for local TLDs. But not .local since that is reserved for multicast only.

Using mosquitto.jnetinc.home.arpa and adding that to the DNS Rewrites fixed the issue. Never knew that about .local. Now to start the process of updating my self signed certs and configs…

1 Like