IPv6 issue: HAOS unreachable over IPv6 on LAN after ~256 seconds

shortcipher · April 10, 2024, 6:37pm

I’m frequently finding that my HA Yellow can’t be reached over IPv6, from multiple hosts on the LAN in the same IPv6 range auto-assigned by my router. Running wireshark on an affected client shows it sending Neighbor Solicitation packets and not receiving a reply. IPv4 works fine, and none of these clients appear to have any firewall config that would affect this. The router and external traffic coming through it can reach the Yellow just fine over IPv6 or IPv4.

Workarounds:

When I go to Settings / System / Network / IPv6, set it to Disabled and then back to Automatic, suddenly HA is reachable on IPv6 again, for a while (possibly until the client reboots?)
Initiating IPv6 contact from HA to another machine, e.g. with ping from the Advanced SSH add-on, will also make that machine able to reach HA over IPv6.
ip neigh add ... permanent on the client also seems to fix it, but that’s not available on ChromeOS, Android, etc.
Adding a static route on the client, to the Yellow’s host specifically via the router (instead of direct to device) also works.

Core: 2024.4.2
Supervisor: 2024.04.0
Operating System: 12.1
Frontend: 20240404.1

Has anyone seen anything similar, or got any ideas? (other than doing without IPv6)

FloatingBoater · April 11, 2024, 5:25pm

Hi,
Almost:

Personally, I tend to see the reverse more often with the HASS web i/f - if AdGuard IPv4 DNS falls over, Linux clients connect over IPv6 fine via Firefox, but other IPv4-only stuff fails.

I’ve occasionally seen Linux IPv4 mDNS advertisement fail (Fedora hostname screw-up), needing a quick sudo systemctl restart avahi-daemon on the non-HASS server.

Yellow, HAOS 12.1 (about to be 12.2), static IPv4, auto IPv6 (both 2a00: & fe80: LL), dual-stack WAN router, AdGuard DNS, Fedora Linux + Android clients, Tasmota/ ESPhome/ homebrew MQTT.
TTFN,

James

wmaker · April 11, 2024, 8:48pm

Not sure, but it sorta/kinda sounds like the client is on the same LAN but is on a different IPv6 network.

Some things to consider/look at (you may have already tried some of this):

Is the client trying to reach HA by name and if yes, what IPv6 address did it resolve to and if not, what address did it use to reach HA?
What source address is the client using when trying to reach HA
when ssh’ing into HA, enter ip -6 addr and look to see what IPv6 addresses are assigned to your HA’s Ethernet interface (interface name something like enp0s3). You should see something like:

2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
    inet6 fdaf:77aa:4dae:faef:7eb6:7c01:5d03:c9b6/64 scope global dynamic noprefixroute 
       valid_lft 1638sec preferred_lft 1638sec
    inet6 fe80::67e2:ae44:5824:8498/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

The first address (scope global dynamic noprefixroute) in this example is based on a “Router Advertisement” Prefix your router sent to HA, and HA automatically derived its own address from this prefix. If your router is not sending a Router Advertisement, then that’s OK, and you won’t see an address like this. If you do, does the client use this same prefix when trying to reach HA? If yes, then the prefixes have lifetimes, and maybe one expired on HA or Client but not vice-versa.

The second address (scope link noprefixroute) is known as “Link Local” (starts as fe80) and all IPv6 endpoints have a link local. However link locals are not routable. When the client tried to contact HA, did it use a Link Local address?

Finally, take a look at the route table in HAOS ip -6 route and see what routes are assigned to HA’s Ethernet address.

shortcipher · April 11, 2024, 10:35pm

For privacy let’s use some fictitious addresses with the same relationship as my real ones:

	lladdr	IPv6
Home Assistant Yellow	11:22:33:44:55:66	2001:8b0:aaaa:bbbb:1322:33ff:fe44:5566
bifrost (a client)	88:77:66:55:44:33	2001:8b0:aaaa:bbbb:8a77:66ff:fe55:4433
router	dd:cc:bb:aa:99:88	2001:8b0:aaaa:bbbb::1 (and others externally)

…where the range 2001:8b0:aaaa:bbbb/64 is advertised to both machines by my router (a Technicolor DGA0122 connected to Andrews & Arnold).

So I don’t think that’s two networks, unless I’m missing something, but to answer the questions in full:

I normally use a DNS name with AAAA = 2001:8b0:aaaa:bbbb:1322:33ff:fe44:5566 and A = the public IPv4 of my router (which has a port-forward), but using the IPv6 address directly gives the same failure (with curl -6 or ping6). If I visit the DNS name in a browser, it falls back to v4 and works.
The client is using the expected 2001:… address. tcpdump shows the following failed neighbor solicitation, and it doesn’t get as far as sending any other v6 traffic to Yellow after this fails:

23:08:41.269218 IP6 2001:8b0:aaaa:bbbb:8a77:66ff:fe55:4433 > ff02::1:ff44:5566: ICMP6, neighbor solicitation, who has 2001:8b0:aaaa:bbbb:1322:33ff:fe44:5566, length 32
23:08:42.280237 IP6 2001:8b0:aaaa:bbbb:8a77:66ff:fe55:4433 > ff02::1:ff44:5566: ICMP6, neighbor solicitation, who has 2001:8b0:aaaa:bbbb:1322:33ff:fe44:5566, length 32
23:08:43.304233 IP6 2001:8b0:aaaa:bbbb:8a77:66ff:fe55:4433 > ff02::1:ff44:5566: ICMP6, neighbor solicitation, who has 2001:8b0:aaaa:bbbb:1322:33ff:fe44:5566, length 32

Here’s the relevant interface on Yellow:

2: end0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
    inet6 2001:8b0:aaaa:bbbb:1322:33ff:fe44:5566/64 scope global dynamic noprefixroute 
       valid_lft 6729sec preferred_lft 6729sec
    inet6 fe80::1322:33ff:fe44:5566/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

…and here are the routes:

2001:8b0:aaaa:bbbb::/64 dev end0  metric 100 
2001:8b0:aaaa:bbbb::/64 via fe80::dfcc:bbff:feaa:9988 dev end0  metric 105 
fd0f:d185:67a:1::/64 via fe80::c63a:8cc0:8f03:ba6d dev end0  metric 100 
fe80::/64 dev hassio  metric 256 
fe80::/64 dev veth54e17ae  metric 256 
fe80::/64 dev vethdec9a02  metric 256 
fe80::/64 dev docker0  metric 256 
fe80::/64 dev veth2974160  metric 256 
fe80::/64 dev vethde69a99  metric 256 
[...]
fe80::/64 dev end0  metric 1024 
default via fe80::dfcc:bbff:feaa:9988 dev end0  metric 100 
anycast 2001:8b0:aaaa:bbbb:: dev end0  metric 0 
anycast fe80:: dev hassio  metric 0 
anycast fe80:: dev veth54e17ae  metric 0 
anycast fe80:: dev vethdec9a02  metric 0 
anycast fe80:: dev docker0  metric 0 
anycast fe80:: dev veth2974160  metric 0 
anycast fe80:: dev end0  metric 0 
anycast fe80:: dev vethde69a99  metric 0 
anycast fe80:: dev veth7f46591  metric 0 
[...]
multicast ff00::/8 dev hassio  metric 256 
multicast ff00::/8 dev veth54e17ae  metric 256 
multicast ff00::/8 dev vethdec9a02  metric 256 
multicast ff00::/8 dev docker0  metric 256 
multicast ff00::/8 dev veth2974160  metric 256 
multicast ff00::/8 dev end0  metric 256 
multicast ff00::/8 dev vethde69a99  metric 256 
multicast ff00::/8 dev veth7f46591  metric 256 
[...]

I’ve trimmed most of the “veth” lines; I seem to have about 14 such interfaces (1 per add-on?). I think the fd0f address is for Tailscale.

NathanCu · April 11, 2024, 11:57pm

Question. In the routes above that was captured from? (sorry a lot of data to look at, just want to make sure I know frame of reference)

I haven’t quite figured out where or why yet but one of these boxes is forgetting the cached ipv6 resolution for some reason

(my bet is HA stopped advertising it, but requires proof)

indication this is the case is when you bounce the ipv6 stack in HA then it suddenly works - it registered that local advertisement when the ipv6 stack came up… Reinforced by the fact that adding a route manually makes it work.

Where you added the route, it doesn’t have it because it’s failing to be refreshed for some reason. AFAIK this is haos’ ipv6 stack that should be responsible.

wmaker · April 12, 2024, 12:02am

I guess I need to do some more research on the anycast, but otherwise, nothing in particular stands out as it does indeed look like they are all on the same IPv6 network.

Maybe another thing to consider, is that neighbor solicitations use multicast, and if you have IGMP snooping turned on for L2 switches, there may be timeouts in L2 switches blocking these multicasts.

shortcipher · April 12, 2024, 12:28am

That was from ip -6 r s on HAOS

shortcipher · April 12, 2024, 2:39pm

I don’t know much about IGMP/Multicast specifically, but I’ve now reproduced this both when the client is ethernet only and when the client is wifi-only, with HA temporarily directly connected to the router, so the router, a Technicolor DGA0122, is the only equipment that’s always been between client and HA. (Usually there is a Netgear GS108 switch between HA and the router.)

I’ve also run an experiment to check the timing: I turned IPv6 off and on in HA, started a stopwatch, and then repeatedly brought a client’s ethernet down and up (thus clearing IPv6 neighbours each time). On every occasion up to and including about 249 seconds on the stopwatch, it could ping6 HA. From about 270 seconds onwards, it could not.

(Edit: and I’ve previously established that a client that successfully makes contact before that cut-off time can then continue to contact HA, at least for a while - I notice the entry still in ip -6 neigh still has the lladdr instead of lacking it and saying FAILED.)

And it doesn’t look like I said yet: two other IPv6 hosts on the network (besides HA) can ping6 each other after being up for much longer than that.

Putting all that together, my current theory is something changes state after HA’s IPv6 has been up for perhaps 256 seconds, in either HA or in what the router thinks about HA specifically.

The router also has a switch to turn IPv6 off, and I’ve found that turning that off and on does not make things work again. I’ll try a full router reboot later.

FloatingBoater · April 12, 2024, 3:18pm

I have a vague recollection of reported issues with IGMP snooping on early Unifi LAN switches/ WLAN APs and IoT kit, but thankfully both are in use here with IPv6 + IPv4 without issues or specific setings.

A quick search for Unifi IGMP issues only brought up multicast video streaming control issues, and not IPv6 DHCP/ DNS/ mDNS discovery, and I’ve not had to read the RFCs (RFC6762 ?) as stuff largely works. (no hits for a 256S timer in the RFC - 120S only)

I know - not helpful at all, sorry!

Do you have a dumb-ish L2 switch to try a test? Managed switches with span ports are nice for pcap, and ISP kit is cheap to obtain, but sometimes the firmware extras can be the problem…

wmaker · April 12, 2024, 6:48pm

somewhat from memory, and sorry for being long winded…
The way IGMP/MLD (Note: IGMP is for IPv4, MLD is for IPv6) works, which by the way is a Layer 3 operation, is that an MLD client (in this case HA) that wants to receive a particular multicast address, in this case the “solicited node multicast” (which is what a Neighbor Solicitation is sent over when asking for a “who has this IP”), that client will send an MLD “Join” out over its local network saying it wants to join that “solicited node multicast address” (and thus allow it to receive NS messages). A neighbor wanting to know HA’s MAC address for a given IPv6 address will send out a Neighbor Solicitation message using the “solicited node multicast” IPv6 group address and HA can now receive it.

However if a Layer2 switch is in the path and has IGMP/MLD snooping enabled, it will block all multicast until it receives a “join” for that multicast group from a client on switch port P, and the snooping switch will then add an entry to its layer-2 switch table to allow any message with that multicast group to go out port P. After a while, if the switch does not receive another “join” from that same client, then it will timeout and remove that entry. MLD client’s will send out a join request when their interface is turned up, but do not send periodic join requests. In normal IGMP/MLD, there is also a Layer3 multicast router configured on the LAN and it will send out MLD “Query” messages periodically and when a MLD client receives one, it will respond by sending out a “join” for all the multicast addresses it wishes to continue to receive. The Layer 2 switch will see this join (again) and refresh its timeout timer.

In many cases, there is no router in the home configured for L3 Multicast, so there is not anything sending out a Query periodically, and consequently the client does not respond with any more joins, and consequently the L2 switch times out that entry. IGMP/MLD snooping is somewhat non-standardized, but may implementations follow the same way of doing things.

Looking at a couple of implementations, the timeout appears to be around 250 to 260 seconds.

Note too, that when L2 IGMP/MLD Snooping is turned off, multicast traffic is simply flooded by the L2 switch just like broadcast traffic.

shortcipher · April 14, 2024, 9:42pm

Thanks both for the continued help.

All my switches are unmanaged (unless you count the router itself I guess) and none of them, nor the router, have any settings about IGMP/MLD/snooping.
I haven’t tried every switch as some of them aren’t convenient to move, but I’ve tried connecting a client and HA to the same switch, such that the router isn’t between them as such, and it didn’t seem to help (Edit: not sure about that after a repeat experiment.)
I re-tested connections between combinations of other IPv6 devices around the house (some wifi, some ethernet). Most worked, but curiously there’s one other host that shows similar symptoms (unreachable on IPv6 unless this host has established contact first). It’s a Raspberry Pi 2 model B on Raspbian 11.
@wmaker This idea of a Query sounds promising… do you know any CLI tool that can send one from Linux?

wmaker · April 14, 2024, 9:55pm

No, I don’t know of any tool… but if it you think IGMP snooping is not being used, then there must be some other problem anyway.

shortcipher · April 14, 2024, 10:26pm

Just to add one more workaround: a client that sends a suitable broadcast ping can then see HA and the other affected server. I guess this works similarly to the when an affected server initiates contact by any other method:

ping6 -I 2001:8b0:aaaa:bbbb:8a77:66ff:fe55:4433 ff02::1%enp6s0

But not all clients can easily do that, and pinging similarly from the affected-server end doesn’t have the same effect, alas.

What I still hope to find, at minimum, is any kind of workaround I can implement on HA (and perhaps the affected Pi) that works for all clients, without having to list them and without taking IPv6 down and up periodically.