I have been migrating a bunch of services from various LXCs and VMs into a single docker swarm, and when I moved InfluxDB2 from a LXC to the swarm (behind Traefik + TLS) HAOS suddenly stopped resolving the domain. Which doesn’t make a lot of sense. The problem isn’t with InfluxDB2 - my Proxmox cluster is happily pushing data into the new instance. The (internal) DNS entry changed what it was pointing at, but the name itself is the same.
> ha dns info
fallback: true
host: 172.30.32.3
llmnr: true
locals:
- dns://10.76.0.2
- dns://10.76.0.3
mdns: true
servers: []
update_available: false
version: 2023.06.2
version_latest: 2023.06.2
> curl -v https://influxdb.local.xxxx.xxxx
* Could not resolve host: influxdb.local.xxxx.xxxx
* Closing connection
curl: (6) Could not resolve host: influxdb.local.xxxx.xxxx
And
Logger: homeassistant.components.influxdb
Source: components/influxdb/__init__.py:487
Integration: InfluxDB (documentation, issues)
First occurred: 12:56:43 AM (575 occurrences)
Last logged: 10:30:51 AM
Cannot connect to InfluxDB due to '<urllib3.connection.HTTPSConnection object at 0x7fdd35858f10>: Failed to establish a new connection: [Errno -2] Name does not resolve'. Please check that the provided connection details (host, port, etc.) are correct and that your InfluxDB server is running and accessible. Retrying in 60 seconds.
Cannot connect to InfluxDB due to '<urllib3.connection.HTTPSConnection object at 0x7fdd3384c150>: Failed to establish a new connection: [Errno -2] Name does not resolve'. Please check that the provided connection details (host, port, etc.) are correct and that your InfluxDB server is running and accessible. Retrying in 60 seconds.
Cannot connect to InfluxDB due to '<urllib3.connection.HTTPSConnection object at 0x7fdd30766950>: Failed to establish a new connection: [Errno -2] Name does not resolve'. Please check that the provided connection details (host, port, etc.) are correct and that your InfluxDB server is running and accessible. Retrying in 60 seconds.
Cannot connect to InfluxDB due to '<urllib3.connection.HTTPSConnection object at 0x7fdd310ff350>: Failed to establish a new connection: [Errno -2] Name does not resolve'. Please check that the provided connection details (host, port, etc.) are correct and that your InfluxDB server is running and accessible. Retrying in 60 seconds.
Cannot connect to InfluxDB due to '<urllib3.connection.HTTPSConnection object at 0x7fdd4b0754d0>: Failed to establish a new connection: [Errno -2] Name does not resolve'. Please check that the provided connection details (host, port, etc.) are correct and that your InfluxDB server is running and accessible. Retrying in 60 seconds.
Any other host on the LAN can locate and connect. I’m very confused here, and it seems clear (to me) that it’s a problem inside ha dns. I don’t understand why changing an A record to a CNAME would cause this unless stuff was cached, but even then I’d expect a timeout to the old IP, not a DNS resolution failure. Restarting DNS didn’t resolve the issue, restarting HA didn’t resolve the issue, and rebooting the VM … didn’t resolve the issue.
Thought I’d make sure the scope of the problem was what I thought, and got a big surprise:
> curl -v http://dns-prod-1.local.xxxx.xxxx
* Trying 10.76.0.2:80...
* Connected to dns-prod-1.local.xxxx.xxxx (10.76.0.2) port 80
.....
Weird. Let’s try another CNAME record.
> curl -v https://dashy.local.xxxx.xxxx
* Could not resolve host: dashy.local.xxxx.xxxx
* Closing connection
curl: (6) Could not resolve host: dashy.local.xxxx.xxxx
Fascinating.
What happens if I change the CNAME for InfluxDB to an A record? After giving it a few minutes to clear any cache:
> curl -v https://influxdb.local.xxxx.xxxx
* Trying: 10.76.0.20:443
* Connected to influxdb.local.****.**** (10.76.0.20) port 443
....
OK, testing an external CNAME works. So… is this some weirdness from AdGuard Home? Running dig against internal (“portainer.local.xxxx.xxxx”) and external (“bookstack.xxxx.xxxx”) URLs show no discernable difference, despite the first being returned by AdGuard Home and the second from Cloudflare.
I’m out of ideas. Changing the DNS record to an A record fixed the initial problem (HA failing to push data to InfluxDB2 - it is now), but it’s a crummy workaround. I’d like to actually fix the problem if possible.
I have not fixed it. In fact, I have discovered a similar issue elsewhere in my stack (a docker image that cannot resolve an internal hostname). The only notable difference is that nslookup returns both the CNAME information, and also a NXDOMAIN. I’m on the hunt to see if i can run this down. It’s especially weird because the Portainer container can connect to this host just fine. But this other container can’t.
(edit: the NXDOMAIN is actually the AAAA/IPv6 lookup. Not sure it’s related. Trying to characterize the overall scope of this issue. For my setup, ProxMox is a layer in the system as well.)
I’m narrowing in, and I think it might be an Adguard Home issue. It appears AAAA (IPv6) requests are being forwarded, and returning NXDOMAIN because, well, there is no upstream record for any *.local.xxxx.xxxx domain.
I’m not quite sure why behavior is different between VMs and inside Docker - but that does appear to be the root cause because HAOS DNS is running inside docker.
AAAA records for external CNAME records return no answer, but AAAA records for internal CNAME records are forwarded to upstream DNS servers which return NXDOMAIN. Trying to see if I can change the behavior in Adguard Home to address this, or if there is a docker setting I can change.
OK, this is absolutely a DNS issue with Adguard Home.
Behavior: resolving a locally defined host, CNAME records will generate both an A search for the host being pointed at (which succeeds) and an AAAA (IPv6) search for the original domain which is forwarded to upstream resolvers which returns NXDOMAIN. This behavior is consistent inside docker, but not on the VM hosts, for some reason.
Fix: if you do not use IPv6, you can disable IPv6 resolving by checking the checkbox “disable resolving of IPv6 addresses” in the DNS settings of Adguard Home, which will just return NOERROR rather than passing the query on. Presumably, if you do use IPv6 you will also add the appropriate entries in Adguard Home.