Troubleshooting mDNS timeouts for Thread/Matter Devices with Proxmox 8

I’m having “random” issues with HAOS and Thread/Matter on my network. At random times mDNS will fail and the Matter server logs are full of mDNS timeouts. Generally either doing a full reboot of the HAOS VM or power cycling both ATVs, or rebooting all 3 devices, resolves the issue. Then the logs will be clean for a couple of days, then mDNS issues crop up again. Everything is on a single internal network. IGMP snooping is disabled on all physical switches, as turning it on caused issues with Proxmox and having the “VLAN aware” setting on the bridge checked.

Configuration:

-HAOS 2023.10.1 VM
-Proxmox 8 host - single physical NIC
-Firewalla Gold Plus router
-Two Apple TVs w/ thread, tvOS 17
-Apple TVs use wired ethernet, as does the Proxmox host
-No SkyConnect
-Everything on a single subnet
-IPv6 DHCP enabled on Firewalla
-QNAP and Netgear switches, IGMP snooping turned off
-Ruckus R650 WAPs

The mDNS issues seem “random”. Sometimes I reboot the HAOS VM and immediately after reboot the logs are filled with mDNS timeout issues: CHIP Error 0x00000032: Timeout

Then I reboot the HAOS VM again, and the Matter logs come up clean and life is good. It’s very frustrating. This issues seems to have gotten worse the last couple of weeks. I know mDNS issues are likely network induced, but I’m not sure where to start diving to get to the root cause.

Reviewing the Thread diagnostics in HAOS, the JSON does show my two ATVs and the route to both. In case it’s a Proxmox issue, here are some sysctl dumps for the IPv6 settings. vmbr0 is the bridge and enp87s0 is the physical NIC. On the Ruckus APs I have the settings that convert multicast to unicast all disabled.

Any help would be greatly appreciated!

net.ipv6.neigh.vmbr0.anycast_delay = 100
net.ipv6.neigh.vmbr0.app_solicit = 0
net.ipv6.neigh.vmbr0.base_reachable_time_ms = 30000
net.ipv6.neigh.vmbr0.delay_first_probe_time = 5
net.ipv6.neigh.vmbr0.gc_stale_time = 60
net.ipv6.neigh.vmbr0.interval_probe_time_ms = 5000
net.ipv6.neigh.vmbr0.locktime = 0
net.ipv6.neigh.vmbr0.mcast_resolicit = 0
net.ipv6.neigh.vmbr0.mcast_solicit = 3
net.ipv6.neigh.vmbr0.proxy_delay = 80
net.ipv6.neigh.vmbr0.proxy_qlen = 64
net.ipv6.neigh.vmbr0.retrans_time_ms = 1000
net.ipv6.neigh.vmbr0.ucast_solicit = 3
net.ipv6.neigh.vmbr0.unres_qlen = 101
net.ipv6.neigh.vmbr0.unres_qlen_bytes = 212992
net.ipv6.conf.enp87s0.disable_ipv6 = 1
net.ipv6.conf.enp87s0.disable_policy = 0
net.ipv6.conf.enp87s0.drop_unicast_in_l2_multicast = 0
net.ipv6.conf.enp87s0.drop_unsolicited_na = 0
net.ipv6.conf.enp87s0.enhanced_dad = 1
net.ipv6.conf.enp87s0.force_mld_version = 0
net.ipv6.conf.enp87s0.force_tllao = 0
net.ipv6.conf.enp87s0.forwarding = 0
net.ipv6.conf.enp87s0.hop_limit = 64
net.ipv6.conf.enp87s0.ignore_routes_with_linkdown = 0
net.ipv6.conf.enp87s0.ioam6_enabled = 0
net.ipv6.conf.enp87s0.ioam6_id = 65535
net.ipv6.conf.enp87s0.ioam6_id_wide = 4294967295
net.ipv6.conf.enp87s0.keep_addr_on_down = 0
net.ipv6.conf.enp87s0.max_addresses = 16
net.ipv6.conf.enp87s0.max_desync_factor = 600
net.ipv6.conf.enp87s0.mc_forwarding = 0
net.ipv6.conf.enp87s0.mldv1_unsolicited_report_interval = 10000
net.ipv6.conf.enp87s0.mldv2_unsolicited_report_interval = 1000
net.ipv6.conf.enp87s0.mtu = 1500
net.ipv6.conf.enp87s0.ndisc_evict_nocarrier = 1
net.ipv6.conf.enp87s0.ndisc_notify = 0
net.ipv6.conf.enp87s0.ndisc_tclass = 0
net.ipv6.conf.enp87s0.proxy_ndp = 0
net.ipv6.conf.enp87s0.ra_defrtr_metric = 1024
net.ipv6.conf.enp87s0.regen_max_retry = 3
net.ipv6.conf.enp87s0.router_probe_interval = 60
net.ipv6.conf.enp87s0.router_solicitation_delay = 1
net.ipv6.conf.enp87s0.router_solicitation_interval = 4
net.ipv6.conf.enp87s0.router_solicitation_max_interval = 3600
net.ipv6.conf.enp87s0.router_solicitations = -1
net.ipv6.conf.enp87s0.rpl_seg_enabled = 0
net.ipv6.conf.enp87s0.seg6_enabled = 0
net.ipv6.conf.enp87s0.seg6_require_hmac = 0
net.ipv6.conf.enp87s0.suppress_frag_ndisc = 1
net.ipv6.conf.enp87s0.temp_prefered_lft = 86400
net.ipv6.conf.enp87s0.temp_valid_lft = 604800
net.ipv6.conf.enp87s0.use_oif_addrs_only = 0
net.ipv6.conf.enp87s0.use_tempaddr = 0
net.ipv6.conf.all.accept_dad = 0
net.ipv6.conf.all.accept_ra = 1
net.ipv6.conf.all.accept_ra_defrtr = 1
net.ipv6.conf.all.accept_ra_from_local = 0
net.ipv6.conf.all.accept_ra_min_hop_limit = 1
net.ipv6.conf.all.accept_ra_mtu = 1
net.ipv6.conf.all.accept_ra_pinfo = 1
net.ipv6.conf.all.accept_ra_rt_info_max_plen = 0
net.ipv6.conf.all.accept_ra_rt_info_min_plen = 0
net.ipv6.conf.all.accept_ra_rtr_pref = 1
net.ipv6.conf.all.accept_redirects = 1
net.ipv6.conf.all.accept_source_route = 0
net.ipv6.conf.all.accept_untracked_na = 0
net.ipv6.conf.all.addr_gen_mode = 0
net.ipv6.conf.all.autoconf = 1
net.ipv6.conf.all.dad_transmits = 1
net.ipv6.conf.all.disable_ipv6 = 0
net.ipv6.conf.all.disable_policy = 0
net.ipv6.conf.all.drop_unicast_in_l2_multicast = 0
net.ipv6.conf.all.drop_unsolicited_na = 0
net.ipv6.conf.all.enhanced_dad = 1
net.ipv6.conf.all.force_mld_version = 0
net.ipv6.conf.all.force_tllao = 0
net.ipv6.conf.all.forwarding = 0
net.ipv6.conf.all.hop_limit = 64
net.ipv6.conf.all.ignore_routes_with_linkdown = 0
net.ipv6.conf.all.ioam6_enabled = 0
net.ipv6.conf.all.ioam6_id = 65535
net.ipv6.conf.all.ioam6_id_wide = 4294967295
net.ipv6.conf.all.keep_addr_on_down = 0
net.ipv6.conf.all.max_addresses = 16
net.ipv6.conf.all.max_desync_factor = 600
net.ipv6.conf.all.mc_forwarding = 0
net.ipv6.conf.all.mldv1_unsolicited_report_interval = 10000
net.ipv6.conf.all.mldv2_unsolicited_report_interval = 1000
net.ipv6.conf.all.mtu = 1280
net.ipv6.conf.all.ndisc_evict_nocarrier = 1
net.ipv6.conf.all.ndisc_notify = 0
net.ipv6.conf.all.ndisc_tclass = 0
net.ipv6.conf.all.proxy_ndp = 0
net.ipv6.conf.all.ra_defrtr_metric = 1024
net.ipv6.conf.all.regen_max_retry = 3
net.ipv6.conf.all.router_probe_interval = 60
net.ipv6.conf.all.router_solicitation_delay = 1
net.ipv6.conf.all.router_solicitation_interval = 4
net.ipv6.conf.all.router_solicitation_max_interval = 3600
net.ipv6.conf.all.router_solicitations = -1
net.ipv6.conf.all.rpl_seg_enabled = 0
net.ipv6.conf.all.seg6_enabled = 0
net.ipv6.conf.all.seg6_require_hmac = 0
net.ipv6.conf.all.suppress_frag_ndisc = 1
net.ipv6.conf.all.temp_prefered_lft = 86400
net.ipv6.conf.all.temp_valid_lft = 604800
net.ipv6.conf.all.use_oif_addrs_only = 0
net.ipv6.conf.all.use_tempaddr = 0
2 Likes

FWIW I experience the same issue, but with a different setup.

  • HAOS 2023.11.3 on a Raspberry Pi 4 host
    • Funnily enough I am actively working on migrating to a Proxmox cluster
  • NetGate 6100 router/firewall
    • Avahi plug-in installed
    • mDNS is enabled on pfSense
    • Followed instructions from this thread to do that: Thread/Matter, Router Rules, and Firewalls
    • However things randomly worked before putting these changes in place too!
  • Using SkyConnect for Zigbee and as my only Thread border router
  • Few different subnets, HAOS on the management VLAN
  • Have IPv6 enabled internally, but this also all worked a few times without it
  • TP-Link Omada setup pretty much across the board (minus the Netgate obviously)
    • Don’t take this as an endorsement. The first company to deliver a large 2.5G managed switch with better software will take my money and these will go on sale!

I see the same CHIP Error 0x00000032 timeouts in my logs when pairing fails. But I do not have as much luck getting it to pair just by giving things a reboot.

It often fails once it gets to the internet connectivity test, but I have had it get past that and onto the pairing with Home Assistant step and fail. No useful information on the screen, just “something went wrong”.

I solved my issue. The problem was bad switch firmware on my QNAP switch. I dumped the QNAP and replaced it with quality switches that support MLD snooping. One switch is a TP-Link TL-SG3428X (NOT using Omada since it has neutered MLD support…must use the switch in standalone mode). The other two switches are Netgear M4300-16X. These switches are enterprise grade with full MLD support as well. But to be fair, the mDNS issues 100% went away with just the TP-Link Jetstream switch.

I wrote a detailed blog post on Thread/Matter, with a bunch of troubleshooting tips in part 3:

Definitely not going to use Apple or Google or whatever for this, but thanks I guess.

Your ONLY working choice is Apple or Google. Gotta pick one thread border router! Per my article, the HA thread border router is only in early dev states with zero UI and no configuration options. You need to use a working thread border router. Maybe in a year the HA OTBR will be production ready, but it’s not even close now. So it’s pointless to use it right now.

This guide is written by the HA Matter lead developer (Marcel):

These are the recommended/supported scenarios for Matter at this time:

Use the HA Companion app on iPhone or Android to commission a Wi-Fi based Matter device (using the phone to do the bluetooth commissioning to te device).

Use the HA Companion app on iPhone to commission a Thread based Matter device utilizing existing Apple Border router(s) like Homepod or ATV 4K.

Use the HA Companion app on Android to commission a Thread based Matter device utilizing existing Google Border router(s) like Nest Hub V2 or Nest WiFi Pro.

Yeah I don’t really want Google or Apple home shit if at all avoidable. But I understand them being recommended for the sake of general consumption.

I ended up getting it working with a SkyConnect.

Main thing was to NOT use the Add Device through the HA companion app under the Matter integration (like I do for ZHA), it would fail like 29 attempts out of 30. Instead let Bluetooth pick it up locally on an Android phone and add it to the HA companion at the end, which only shits the bed maybe once or twice before working. :slight_smile:

A recent SkyConnect multi-protocol update (2.4.2) basically completely broke ZHA for me and I had to roll it back (to 2.3.2), so it certainly needs more time in the oven. But I’m here for it.

1 Like

I’m having all kinds of issues and just came across your guide. So sorry to bump an old post. :slight_smile:

I do have an unmanaged switch that sits between my network AP and both my HA server and my primary HomeKit hub. Wondering if that might be the thing causing some issues.

I noticed you recommended a TP-Link switch that is managed and has 4 10GE SFP+ slots. As a non-network guy…I don’t know what this means. I am on a 500 mbps Fiber plan so would these slots even help me? The option with just 4 SBP slots is quite a bit cheaper.

Also I am using TP-Link Deco routers for my network and wondering if there is a switch comparable to the one in question that could be managed directly in the Deco app without the need to also get the Omada app. But I can investigate on that part.

Finally, my HAOS is running within Proxmox so I will review that portion of your guide next. I typically use Apple to commission my Matter devices first, though, then use codes from Apple to push them to other platforms.

I have been running a smart home for years but am a newb when it comes to networking so any advice will be appreciated!

EDIT: I have been playing with Matter in HA, HomeKit, SmartThings, and the Aqara app and all 4 hubs sit behind this unmanaged switch. You have given me a lot to think about, my friend.