Improve Privacy, Stop using hardcoded DNS

btasker · November 5, 2021, 7:00am

Yep it should do.

Under the hood what it does is periodically go into the HomeAssistant DNS container and check what config coredns has. If it’s not the “approved” version (i.e. has reverted to the official version) it’ll change the config and restart the coredns process to make it use the new config.

There’s also a mode that should prevent supervisor from auto-updating and quietly breaking your install (trying to fix that originally led me to this DNS issue).

There are 2 modes, depending on your preference:

auto-patch (the default) just removes any reference to 127.0.0.1:5553 (which is where the fallback listens)
Template applies a config that you provide in /config

auto-patch currently probably works best for most, but it’s possible that that might change in future HomeAssistant releases.

Take with a pinch of salt, but my installation definitely feels more responsive with it installed - Cloudflare’s DNS is slow compared to my local (partly because it’s more distant). I used to assume that the ewelink API was slow, turns out that it was probably the name resolution.

Stooovie · November 5, 2021, 7:46am

Thanks for a detailed answer! Excuse this possibly silly question, but can this (local operation of HA) be tested by just yanking out the WAN cable off my router? It should, right?

btasker · November 5, 2021, 8:22am

I’d have thought so, yeah.

What you observe will probably depend on how your local DNS (router I assume) handles not being able to reach an upstream. Some sit and timeout (so you’d still see delays), others fail early (so it’ll be quite responsive, but obviously stuff relying on external name resolution will be broken).

CoreDNS uses a default dial timeout of 30s, so with the official version you’ve got quite a big delay in there when the WAN is down (calling scripts might use a shorter timeout - I haven’t looked, but I suspect not).

The timeout is built into Coredns and isn’t a configurable AFAIK - yet another reason why baking coreDNS into HomeAssistant is a bizzare choice, it’s most commonly seen in kubernetes clusters, not appliances (there are much better suited options if local resolution is a must)

Let me know how you get on, it’s possible I may be able to find a way around it if it proves to be a sticking point

btasker · November 6, 2021, 5:15pm

For reference

The… uh… assault on your network if you block Cloudflare at the edge is the result of a misconfiguration in the HomeAssistant CoreDNS config: I’ve raised a bug here: CoreDNS is misconfigured leading to unexpected healthcheck behaviour · Issue #64 · home-assistant/plugin-dns · GitHub

Basically, because that fallback is in the main forward statement, it gets healthchecked once a minute. If you’ve blocked Cloudflare, then the upstream queries will fail and the :5553 block will retry them, slowly building up a nice little traffic storm.

It is, very much, a bug in the HA DNS setup

Stooovie · November 6, 2021, 6:28pm

I tried yanking the WAN cable out of the router (to simulate local-only operation) and most function of HA was retained. Yeelight bulbs switched to LAN mode after a few seconds. Of course Tuya wouldn’t work but that’s expected. Even most Xiaomi stuff worked locally, which was unexpected. Not bad.

Stooovie · November 6, 2021, 6:31pm

Does DNS blocking with Adguard Home (running as a HA addon on the same Rpi4) come into play at all with this? I have set my router DHCP server’s primary DNS to the Pi’s IP, and the secondary to 8.8.8.8, so all devices are filtered by AGH and there’s a fallback if AGH isn’t running. Does this matter with the issue at hand at all?

tescophil · November 6, 2021, 8:24pm

Short answer, no. The fallback bypasses any local DNS settings on your network (that’s the problem)

tescophil · November 6, 2021, 8:27pm

@btasker Nice issue raised in GutHub, but I predict it will just be closed with no response (just like the last four issues I raised).

Fingers crossed I’m wrong.

btasker · November 6, 2021, 11:42pm

This is good news.

I captured some metrics (leading to the GH issue above) and found that use of Cloudflare added about 1/2 second of latency on for me (despite CF only being about 15ms away). I’ve been meaning to rewrite the fallback to use UDP to try and prove whether it’s DoT overhead, but got sidetracked on other things.

Having adguard active shouldn’t come into play, no - as Phil says, the underlying problem is that the cloudflare fallback bypasses everything/anything you’ve set up on the LAN.

btasker · November 6, 2021, 11:45pm

Hopefully it’ll gain some traction, but I’m not going to hold my breath.

To be honest, if it doesn’t, then I’ve already got a path in mind.

I have a working PoC for replacing supervisor and coredns images with patched versions without triggering the codeNotary checks (so it won’t mark itself unhealthy/unsupported and block updates) - if push comes to shove that’ll allow me to patch out any bits they’re not willing to fix as and when they arise, and then I can get back to using HA as an appliance rather than a source of frustration

pk198105 · November 7, 2021, 4:25pm

what is so difficult to make this a configurable option? Let the people decide, let people opt out from the DNS TLS Cloudflare lookup if they dont want

Djelibeybi · November 7, 2021, 8:11pm

I replaced the CoreDNS image with custom-made dnsmasq container instead. Works great.

myhades · November 8, 2021, 3:11am

This’s getting unacceptable.
My HA runs in a network where all traffic must go through my own DNS server to avoid DNS poisoning, thus any other DNS requests is blocked or redirected.
I wasn’t aware of the situation before this post and it explains A LOT why HA’s been countering some weird network issues.

PLEASE DON’T force people to use embedded DNS, cause google devices do and they’re a such a pain in the ass.

btasker · November 8, 2021, 8:34am

Just watch your install doesn’t get marked Unsupported and Unhealthy as a result - it’ll prevent you from installing addon updates.

One of the things Supervisor does is list out running containers and check they’re “approved”. One of the conditions of a supervisor install is

The operating system is dedicated to running Home Assistant Supervised.

(from architecture/adr/0014-home-assistant-supervised.md at 3331ea0255844a46d2830f3d780672b37b5793ff · home-assistant/architecture · GitHub).

It feels like there’s a certain disconnect with reality in all this. I don’t mind not getting support where I’ve done something custom, I do mind not being able to update stuff though.

Djelibeybi · November 8, 2021, 12:26pm

Yes, this caught me when I first tried replacing it without any nuance. Supervisor went nuts trying to replace it but of course, as soon as it stopped the container, it lost name resolution and therein lies madness (and possibly ironically, the reason DNS is so tenaciously configured).

Since then I have prepared my replacement container such that it is no longer rejected by Supervisor, but it does take a little bit of a fandango during upgrades.

jspanitz · November 23, 2021, 8:11pm

This is fantastic. I did however have an issue trying to install:

Failed to install add-on

The command ‘/bin/bash -o pipefail -c apk add --no-cache docker’ returned a non-zero code: 1

From system log:
21-11-23 15:05:47 INFO (SyncWorker_6) [supervisor.docker.addon] Starting build for 68e874ae/aarch64-addon-coredns-fix:0.1.1
21-11-23 15:06:01 ERROR (SyncWorker_6) [supervisor.docker.addon] Can’t build 68e874ae/aarch64-addon-coredns-fix:0.1.1: The command ‘/bin/bash -o pipefail -c apk add --no-cache docker’ returned a non-zero code: 1
21-11-23 15:06:01 ERROR (SyncWorker_6) [supervisor.docker.addon] Build log:
Step 1/23 : ARG BUILD_FROM=ghcr.io/hassio-addons/base/amd64:10.0.1
Step 2/23 : FROM ${BUILD_FROM}
—> 77db0d03c09e
Step 3/23 : SHELL ["/bin/bash", “-o”, “pipefail”, “-c”]
—> Using cache
—> 2643c024c7ad
Step 4/23 : ENV TERM=“xterm-256color”
—> Using cache
—> 330311bea649
Step 5/23 : ARG BUILD_ARCH=amd64
—> Using cache
—> cfccee773891
Step 6/23 : RUN apk add --no-cache docker
—> Running in 0c97ca0a26fa
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/main/aarch64/APKINDEX.tar.gz
WARNING: Ignoring https://dl-cdn.alpinelinux.org/alpine/v3.14/main: temporary error (try again later)

btasker · November 24, 2021, 8:55am

I think the Alpine repo’s been having some issues this week - I had an issue building another image the other day.

I should probably push a pre-built image to a registry though, there’s no real need to have to build it each time as we’re not customising anything based on the build host

jspanitz · November 27, 2021, 7:29pm

Yeah just tried again and same error - will keep trying until I forget

GaryK · November 27, 2021, 8:42pm

Supervisor log output:

21-11-27 13:38:18 INFO (MainThread) [supervisor.store.git] Cloning add-on https://github.com/bentasker/HomeAssistantAddons/ repository
21-11-27 13:39:06 ERROR (MainThread) [supervisor.store.git] Can't clone https://github.com/bentasker/HomeAssistantAddons/ repository: Cmd('git') failed due to: exit code(128)
  cmdline: git clone -v --recursive --depth=1 --shallow-submodules https://github.com/bentasker/HomeAssistantAddons/ /data/addons/git/68e874ae
  stderr: 'Cloning into '/data/addons/git/68e874ae'...
fatal: unable to access 'https://github.com/bentasker/HomeAssistantAddons/': The requested URL returned error: 504
'.
21-11-27 13:39:06 INFO (MainThread) [supervisor.resolution.module] Create new issue IssueType.FATAL_ERROR - ContextType.STORE / 68e874ae
21-11-27 13:39:06 INFO (MainThread) [supervisor.resolution.module] Create new suggestion SuggestionType.EXECUTE_REMOVE - ContextType.STORE / 68e874ae
21-11-27 13:39:06 ERROR (MainThread) [supervisor.store] Can't load data from repository https://github.com/bentasker/HomeAssistantAddons/

It appears Github has an issue today.

tescophil · January 8, 2022, 3:21pm

Not wanting to flog a dead horse, but I’ve raised yet another issue on GitHub about this, as the behaviour on start-up has changed with the latest version

github.com/home-assistant/supervisor

DNS Message Storm

opened 03:13PM - 08 Jan 22 UTC

tescophil

bug

### Describe the issue you are experiencing After updating to the latest vers…ion and rebooting the host machine, I see a storm of DNS messages to 1.1.1.1 and 1.0.0.1 at a rate of approximately 1,330 requests per minute. This persists for 10-12 minutes then stops. During this time, a 'normal' number of DNS(53) requests are also sent to the locally assigned DNS server, the system is fully functional, but shows an approximate doubling of CPU usage during this period. ![Screenshot from 2022-01-08 15-01-53](https://user-images.githubusercontent.com/59442445/148649166-dfc9acfc-0feb-46a0-8437-d4f4e78bfa78.png) ### What is the used version of the Supervisor? supervisor-2021.12.2 ### What type of installation are you running? Home Assistant OS ### Which operating system are you running on? Home Assistant Operating System ### What is the version of your installed operating system? Home Assistant OS 7.1 ### What version of Home Assistant Core is installed? core-2021.12.8 ### Steps to reproduce the issue 1) All port 853 DNS requests on the network are redirected to a local service. N.B. In previous versions, blocking 853 requests would result in a 'message storm', but redirecting these same requests would not. 2) Update and reboot the host ### Anything in the Supervisor logs that might be useful for us? ```txt These are the DNS logs, which I think are unhelpful because they contain no timestamp information. [INFO] 127.0.0.1:39215 - 9428 "NS IN . udp 17 false 512" NOERROR - 0 5.432606477s [ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: i/o timeout [INFO] 127.0.0.1:34895 - 59462 "NS IN . udp 17 false 512" NOERROR - 0 5.458035354s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:56739 - 46027 "NS IN . udp 17 false 512" NOERROR - 0 5.331660312s [ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: i/o timeout [INFO] 127.0.0.1:58808 - 46741 "NS IN . udp 17 false 512" NOERROR - 0 5.420126791s [ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: i/o timeout [INFO] 127.0.0.1:38816 - 4732 "NS IN . udp 17 false 512" NOERROR - 0 5.405921786s [ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: i/o timeout [INFO] 127.0.0.1:51994 - 23540 "NS IN . udp 17 false 512" NOERROR - 0 5.504404725s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:51276 - 50772 "NS IN . udp 17 false 512" NOERROR - 0 5.542411367s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:58576 - 56873 "NS IN . udp 17 false 512" NOERROR - 0 5.570220953s [ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: i/o timeout [INFO] 127.0.0.1:39111 - 59553 "NS IN . udp 17 false 512" NOERROR - 0 5.511247088s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:50327 - 60812 "NS IN . udp 17 false 512" NOERROR - 0 5.502660682s [ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: i/o timeout [INFO] 127.0.0.1:37019 - 7802 "NS IN . udp 17 false 512" NOERROR - 0 5.4931429099999995s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:35885 - 8511 "NS IN . udp 17 false 512" NOERROR - 0 5.464419785s [ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: i/o timeout [INFO] 127.0.0.1:49851 - 43931 "NS IN . udp 17 false 512" NOERROR - 0 5.392757149s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:50418 - 31856 "NS IN . udp 17 false 512" NOERROR - 0 5.4743177020000005s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:59344 - 43489 "NS IN . udp 17 false 512" NOERROR - 0 5.485432552s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:45905 - 12656 "NS IN . udp 17 false 512" NOERROR - 0 5.404956821s [ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: i/o timeout [INFO] 127.0.0.1:57442 - 55380 "NS IN . udp 17 false 512" NOERROR - 0 5.411243457s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:47079 - 42357 "NS IN . udp 17 false 512" NOERROR - 0 5.509683664s [ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: i/o timeout [INFO] 127.0.0.1:45440 - 9175 "NS IN . udp 17 false 512" NOERROR - 0 5.474020209s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:33783 - 60902 "NS IN . udp 17 false 512" NOERROR - 0 5.51442521s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:45153 - 24924 "NS IN . udp 17 false 512" NOERROR - 0 5.480785116s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:47433 - 6081 "NS IN . udp 17 false 512" NOERROR - 0 5.524554836s [ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: i/o timeout [INFO] 127.0.0.1:57140 - 486 "NS IN . udp 17 false 512" NOERROR - 0 5.42725042s [ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: i/o timeout [INFO] 127.0.0.1:36694 - 39252 "NS IN . udp 17 false 512" NOERROR - 0 5.565304869s [ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: i/o timeout [INFO] 127.0.0.1:40078 - 11967 "NS IN . udp 17 false 512" NOERROR - 0 5.602336224s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:43475 - 15053 "NS IN . udp 17 false 512" NOERROR - 0 5.644719862s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:60825 - 21094 "NS IN . udp 17 false 512" NOERROR - 0 5.578912966s [ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: i/o timeout [INFO] 127.0.0.1:40380 - 11105 "NS IN . udp 17 false 512" NOERROR - 0 5.534997791s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:33645 - 59755 "NS IN . udp 17 false 512" NOERROR - 0 5.457796136s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:46080 - 61189 "NS IN . udp 17 false 512" NOERROR - 0 5.377324234s [ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: i/o timeout [INFO] 127.0.0.1:56699 - 34585 "NS IN . udp 17 false 512" NOERROR - 0 5.3598043650000005s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:53016 - 30704 "NS IN . udp 17 false 512" NOERROR - 0 5.393816554s [ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: i/o timeout [INFO] 127.0.0.1:59192 - 19357 "NS IN . udp 17 false 512" NOERROR - 0 5.540754531s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:39000 - 57883 "NS IN . udp 17 false 512" NOERROR - 0 5.560830907s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:56889 - 52884 "NS IN . udp 17 false 512" NOERROR - 0 5.537498188s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:32886 - 53561 "NS IN . udp 17 false 512" NOERROR - 0 5.421969831s [ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: i/o timeout [INFO] 127.0.0.1:33900 - 60616 "NS IN . udp 17 false 512" NOERROR - 0 5.346150499s [ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: i/o timeout [INFO] 127.0.0.1:44171 - 15542 "NS IN . udp 17 false 512" NOERROR - 0 5.372345367s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:55933 - 11828 "NS IN . udp 17 false 512" NOERROR - 0 5.463023343s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:39654 - 35258 "NS IN . udp 17 false 512" NOERROR - 0 5.468075386s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:52962 - 64931 "NS IN . udp 17 false 512" NOERROR - 0 5.475450687s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:42763 - 40932 "NS IN . udp 17 false 512" NOERROR - 0 5.543031639s [ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: i/o timeout [INFO] 127.0.0.1:46030 - 24171 "NS IN . udp 17 false 512" NOERROR - 0 5.441059923s [ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: i/o timeout [INFO] 127.0.0.1:48671 - 39229 "NS IN . udp 17 false 512" NOERROR - 0 5.490991256s [ERROR] plugin/errors: 2 . NS: dial tcp 1.0.0.1:853: i/o timeout [INFO] 127.0.0.1:51896 - 51917 "NS IN . udp 17 false 512" NOERROR - 0 5.526528155s [ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: i/o timeout [INFO] 127.0.0.1:60556 - 56650 "NS IN . udp 17 false 512" NOERROR - 0 5.489073053s [ERROR] plugin/errors: 2 . NS: dial tcp 1.1.1.1:853: i/o timeout [INFO] 127.0.0.1:51071 - 35802 "NS IN . udp 17 false 512" NOERROR - 0 4.311122831s [ERROR] plugin/errors: 2 . NS: x509: certificate is valid for *.mydomain.org, mydomain.org, not cloudflare-dns.com [INFO] 127.0.0.1:43592 - 61309 "NS IN . udp 17 false 512" NOERROR - 0 1.328996582s [ERROR] plugin/errors: 2 . NS: x509: certificate is valid for *.mydomain.org, mydomain.org, not cloudflare-dns.com [INFO] 127.0.0.1:47269 - 991 "NS IN . udp 17 false 512" NOERROR - 0 2.8312739479999998s [ERROR] plugin/errors: 2 . NS: x509: certificate is valid for *.mydomain.org, mydomain.org, not cloudflare-dns.com [INFO] 127.0.0.1:51775 - 59057 "NS IN . udp 17 false 512" NOERROR - 0 0.070910355s [ERROR] plugin/errors: 2 . NS: x509: certificate is valid for *.mydomain.org, mydomain.org, not cloudflare-dns.com ``` ### Additional information I understand that the coreDNS component used in HA is configured to use these Cloudflair DNS servers as a fallback, however when these servers are unreachable (Clearly, redirected DoT DNS requests will fail because of the certificate mismatch), it should not cause the flood of network traffic observer at startup, especially when the locally configured DNS services are functioning perfectly.