HomeKit Accessory Protocol (HAP) over CoAP/UDP (was: Nanoleaf Essentials bulb via Thread/CoAP)

Anyone else following along running into a similar thing, I finally got this device working today. I am 95% sure it’s due to having an 802.11ad bonded interface. When I set net.ipv6.conf.br0.accept_ra=2, I also needed to set net.ipv6.conf.bond0.accept_ra=2

1 Like

I’m experiencing similar behavior. Homekit Controller Thread accessories are pairable and work very well, but will sporadically become unavailable and require some combination of Reloading the Integration, restarting Home Assistant, power cycling all homepod mini’s (thread border routers) and as a last resort a factory reset + repairing of the device itself.

I am seriously considering ripping every thread accessory out of Home Assistant and directly pairing with Homekit again until these random disconnects can be sorted out because it’s making these devices useless half the time

1 Like

@setomerza I’m having trouble reproducing this locally because I have a very small number of devices. Do you feel comfortable modifying a few Python files to help debug and to try a few solutions?

Can you please try making the following change:

  1. get to the HA console (depends on how you have it deployed: ssh, serial console, …)
  2. get to the core container: docker exec -it homeassistant /bin/sh
  3. go to the aiohomekit CoAP folder: cd /usr/local/lib/python3.10/site-packages/aiohomekit/controller/coap
  4. open the connection code: vi connection.py
  5. search for the function: /def is_connected
  6. arrow down one line (to the return statement) and press A to append
  7. add a space and paste and self.enc_ctx.send_ctr - self.enc_ctx.recv_ctr < 10
  8. save changes by pressing Esc and then ZZ
  9. restart HA through the web interface
  10. do NOT update HA for a few days as that will wipe out the changes
  11. please report back if it helps or not :slight_smile:

Some background: that shouldn’t let an accessory go more than 10 minutes of no communication as HA will try to poll device state once a minute. We will likely be able to make 10 smaller, I’m just starting with a bit of margin. If it improves things, please feel free to try down to as small as 2 or 3.

I know in Thread v1.3 border routers are supposed to play nice with each other, but how does this work in practice (or does it at this point since Thread v1.3 is relatively new). E.g. Let’s say I have a HomePod, and then add an Echo that supports Thread. Obviously the Echo can used mDNS to discover the HomePod’s Thread network, but how do they “link?” Presumably there authentication keys that need to be exchanged, or some sort of user action that is needed to say “join this existing network?”

I currently have 16 thread devices connected via 3 Homepod Minis to Homekit controller (mix of Nanoleaf Essential A19 bulbs, Eve Energy, Eve Door Sensors, and an Eve Weather). Initially getting everything set up and paired into Home Assistant was a breeze and everything stayed online without issue for a few days, until devices started suddenly falling off the network.

It already seems like some type of device polling/reconnecting is going on now because every minute or so the red icons will turn grey as the device tries to re-initialize and fails.

Please see the attached info below and let me know if you still would like me to try setting up the 10 minute polling

Here are some relevant Logs that might be helpful.

Logger: aiohomekit.controller.coap.connection
Source: /usr/local/lib/python3.10/site-packages/aiohomekit/controller/coap/connection.py:391 
First occurred: November 5, 2022 at 1:35:03 PM (29063 occurrences) 
Last logged: 1:28:05 PM

Pair verify failed
OSError: [Errno 113] received through errqueue

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/aiohomekit/controller/coap/connection.py", line 386, in connect
    await self.do_pair_verify(pairing_data)
  File "/usr/local/lib/python3.10/site-packages/aiohomekit/controller/coap/connection.py", line 345, in do_pair_verify
    response = await asyncio.wait_for(
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/usr/local/lib/python3.10/site-packages/aiocoap/protocol.py", line 597, in _run_outer
    await cls._run(app_request, response, weak_observation, protocol, log)
  File "/usr/local/lib/python3.10/site-packages/aiocoap/protocol.py", line 656, in _run
    blockresponse = await blockrequest.response
aiocoap.error.NetworkError: [Errno 113] received through errqueue
Logger: coap-server
Source: runner.py:119 
First occurred: November 5, 2022 at 1:35:01 PM (30522 occurrences) 
Last logged: 1:28:05 PM

Error received and ignored in this codepath: [Errno 113] Host is unreachable
Received Type.ACK from <UDP6EndpointAddress [fd81:d68c:e2c2:0:d9b5:6ee7:9f9b:bca7] (locally fd45:7a0a:5395:42f1:11b5:f444:ec4e:8451%eth0)>, but could not match it to a running exchange.
Received Type.ACK from <UDP6EndpointAddress [fd81:d68c:e2c2:0:247f:a7fd:8eab:4158] (locally fd45:7a0a:5395:42f1:11b5:f444:ec4e:8451%eth0)>, but could not match it to a running exchange.
Received Type.ACK from <UDP6EndpointAddress [fd0c:29f0:2531:0:e499:104d:1488:274a] (locally fd45:7a0a:5395:42f1:11b5:f444:ec4e:8451%eth0)>, but could not match it to a running exchange.
Logger: homeassistant.config_entries
Source: config_entries.py:1088 
First occurred: November 5, 2022 at 1:35:41 PM (2333 occurrences) 
Last logged: 1:27:57 PM

Config entry 'Entrance Closet Bulb' for homekit_controller integration not ready yet: failed to connect; Retrying in background
Config entry 'Crawl Space Front Light' for homekit_controller integration not ready yet: failed to connect; Retrying in background
Config entry 'Office Closet Door' for homekit_controller integration not ready yet: failed to connect; Retrying in background
Config entry 'Alarm Siren Outlet' for homekit_controller integration not ready yet: failed to connect; Retrying in background
Config entry 'Garage Closet Bulb' for homekit_controller integration not ready yet: failed to connect; Retrying in background
Logger: aiohttp.server
Source: /usr/local/lib/python3.10/site-packages/aiohttp/web_protocol.py:405 
First occurred: November 5, 2022 at 2:00:08 PM (3 occurrences) 
Last logged: 1:21:45 PM

Unhandled exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1191, in _sendfile_fallback
    read = await self.run_in_executor(None, file.readinto, view)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/aiohttp/web_protocol.py", line 514, in start
    resp, reset = await task
  File "/usr/local/lib/python3.10/site-packages/aiohttp/web_protocol.py", line 460, in _handle_request
    reset = await self.finish_response(request, resp, start_time)
  File "/usr/local/lib/python3.10/site-packages/aiohttp/web_protocol.py", line 613, in finish_response
    await prepare_meth(request)
  File "/usr/local/lib/python3.10/site-packages/aiohttp/web_fileresponse.py", line 286, in prepare
    return await self._sendfile(request, fobj, offset, count)
  File "/usr/local/lib/python3.10/site-packages/aiohttp/web_fileresponse.py", line 99, in _sendfile
    await loop.sendfile(transport, fobj, offset, count)
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1170, in sendfile
    return await self._sendfile_fallback(transport, file,
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1200, in _sendfile_fallback
    await proto.restore()
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 268, in restore
    self._transport.resume_reading()
  File "/usr/local/lib/python3.10/asyncio/sslproto.py", line 343, in resume_reading
    self._ssl_protocol._transport.resume_reading()
AttributeError: 'NoneType' object has no attribute 'resume_reading'

@setomerza

Oof. The first error (Pair verify failed) indicates initial or reconnect communication failure. The second (Received Type.ACK) seems to be harmless noise based on many similar reports. The third I’m guessing is from restarting HA or reloading the integration? Likely caused by the first error. The fourth isn’t related to Thread/CoAP AFAICT.

The issue I’m hoping to fix with the one-line patch above is where communication drops & neither the Border Router nor the device return any error condition. All we have are timeouts. Thinking about this more, you could start with a smaller number than 10, maybe even 1. And yes, please do try the patch.

If there are continued pair verify errors, we’ll dig into that next.

@lambdafunction Okay I appended the patch and turned it down to 1 minute, restarted HA and let everything settle for a while and I’m still seeing the same behavior with the same half of devices unable to connect and the “Pair verify failed” error repeating in logs

Logger: aiohomekit.controller.coap.connection
Source: /usr/local/lib/python3.10/site-packages/aiohomekit/controller/coap/connection.py:391
First occurred: 2:18:41 PM (83 occurrences)
Last logged: 2:23:23 PM

Pair verify failed
OSError: [Errno 113] received through errqueue

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/aiohomekit/controller/coap/connection.py", line 386, in connect
    await self.do_pair_verify(pairing_data)
  File "/usr/local/lib/python3.10/site-packages/aiohomekit/controller/coap/connection.py", line 345, in do_pair_verify
    response = await asyncio.wait_for(
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/usr/local/lib/python3.10/site-packages/aiocoap/protocol.py", line 597, in _run_outer
    await cls._run(app_request, response, weak_observation, protocol, log)
  File "/usr/local/lib/python3.10/site-packages/aiocoap/protocol.py", line 656, in _run
    blockresponse = await blockrequest.response
aiocoap.error.NetworkError: [Errno 113] received through errqueue

Interesting… if it is the exact same set of devices each time, that makes me think those devices have some sort of packet loss or other issue. Can you please pick one of the devices that is having trouble and ping6 it while HA is running? Leave the ping running for over a minute to see if there is any packet loss, either regularly or during the HA state poll. You can use mDNS tools to get the hostname or pick a misbehaving NL bulb which should have a pattern like Nanoleaf-A19-XXXX.local. You could also try disabling a few HAP+Thread devices and see if the other ones improve.

This is slightly off-topic, but I’ll answer here.

The process of adding a Thread node, including any router or border router, to a Thread network is called commissioning. In order to commission a node, you need to use a commissioning agent. This is typically going to be a smartphone platform function (e.g. something provided by Android or iOS itself), or potentially a smartphone app (in the case of vendors that do not own a smartphone platform). The bottom line of a commissioning agent is something that knows a set of network (e.g. PAN ID) and cryptographic parameters to allow the node to join a particular Thread network. The node is usually configured through some side-band channel (e.g. Bluetooth, a Wifi network broadcast by the device, a serial port, etc).

The Connectivity Standards Alliance (CSA) members have agreed to build methods to exchange these parameters in order to “merge” their up-to-now disparate Thread networks. For example, Apple added the THCredentials API. Note that this effectively has nothing to do with Thread or Matter per-se.

3 Likes

I have 2 Nanoleaf Essentials A19 bulbs, the first 1 paired using thread smoothly, I spent hours trying to pair the 2nd one but I always got the below error. I finally unscrewed the 1st (paired) bulb restarted HA and paired the 2nd one and it worked. Screwed back the 1st one and the same error is back, tried cycling power for the bulbs, homepods, HA, nothing works, consistently the same error, one of the bulbs works (whichever connects first) and the 2nd throws the below error:

Pair verify failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/aiohomekit/controller/coap/connection.py", line 386, in connect
    await self.do_pair_verify(pairing_data)
  File "/usr/local/lib/python3.10/site-packages/aiohomekit/controller/coap/connection.py", line 334, in do_pair_verify
    coap_client = await Context.create_server_context(root, bind=("::", 0))
  File "/usr/local/lib/python3.10/site-packages/aiocoap/protocol.py", line 287, in create_server_context
    await self._append_tokenmanaged_transport(
  File "/usr/local/lib/python3.10/site-packages/aiocoap/protocol.py", line 156, in _append_tokenmanaged_transport
    transport = await token_interface_constructor(tman)
  File "/usr/local/lib/python3.10/site-packages/aiocoap/transports/ws.py", line 191, in create_transport
    server = await websockets.serve(
  File "/usr/local/lib/python3.10/site-packages/websockets/legacy/server.py", line 1117, in __await_impl__
    server = await self._create_server()
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1505, in create_server
    raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('::', 3000, 0, 0): address in use

Any idea how can I fix this?

@06_ripcord.ogress I think that error was fixed in a recent HA release, 2022.9.X or 2022.10.X (aiocoap bug). What version are you on? If you are on a recent version, we’ll need to figure out what version of the aiocoap library you are using.

@lambdafunction Thanks for responding, I’m on the latest version 2022.11.x

@06_ripcord.ogress can you please sanity check a few things for me:

grep ^Version /usr/local/lib/python3.10/site-packages/aiocoap-*/METADATA

grep -n -C 3 3000 /usr/local/lib/python3.10/site-packages/aiocoap/transports/ws.py

Are you using HA OS?

I didn’t implement your fix but just wanted to add that I experience the same problem. I’ve added 6 Thread Eve Thermos (which worked without any problems) and they all work for a few days. Then suddenly some or all of them aren’t available anymore in homeassistant. It’s not always the same device. Sometimes rebooting Home Assistant or the Home Pod mini/ Apple TV fixes it for a few hours/ some thermostat or even a few days but that’s not reliable. I didn’t get all of them to reconnect today - as of now only one device is reachable.

I tried to ping the devices and they are pingable (at least 5/6 are - although I can only control 1/6 at the moment)

Logger: aiohomekit.controller.coap.connection
Source: /usr/local/lib/python3.10/site-packages/aiohomekit/controller/coap/connection.py:391 
First occurred: 18:03:44 (90 occurrences) 
Last logged: 18:14:46

Pair verify failed
OSError: [Errno 113] received through errqueue

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/aiohomekit/controller/coap/connection.py", line 386, in connect
    await self.do_pair_verify(pairing_data)
  File "/usr/local/lib/python3.10/site-packages/aiohomekit/controller/coap/connection.py", line 345, in do_pair_verify
    response = await asyncio.wait_for(
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/usr/local/lib/python3.10/site-packages/aiocoap/protocol.py", line 597, in _run_outer
    await cls._run(app_request, response, weak_observation, protocol, log)
  File "/usr/local/lib/python3.10/site-packages/aiocoap/protocol.py", line 656, in _run
    blockresponse = await blockrequest.response
aiocoap.error.NetworkError: [Errno 113] received through errqueue

@DaJonas @setomerza can both of you try a quick change, let’s extend the amount of time pair verify has to finish. Please edit /usr/local/lib/python3.10/site-packages/aiohomekit/controller/coap/connection.py around line 347. You can search for do_pair_verify and then down ~19 lines should be the wait_for call with timeout=8.0. Can you please change that timeout to 45.0?

Background: aiocoap retries unanswered packets automatically with an exponential backoff (1s, 2s, 4s, 8s, …). Perhaps we aren’t giving busy/lossy networks enough time…

@lambdafunction

@06_ripcord.ogress that’s really weird… you should be on aiocoap 0.4.4. Let me look around, maybe some other component is pulling in 0.4.1.

EDIT: if you want to fix this now, change the else: line on 187 of ws.py to elif port != 0: being careful to preserve the spacing and restart HA.

@setomerza @DaJonas it is possible you are having the same issue if you are on 2022.11.x. Can you please check your aiocoap version? And please make this change. It is a fix I pushed into aiocoap months ago to solve this problem. I’m not sure why it has downgraded.

I am up to 8 thread homekit devices (though 7 in active use). Mixture of Eve, Nanoleaf and Wemo.

The biggest source of outages for me has been my HomePod hopping between wifi networks (moved it to my iot vlan so the RA’s were handled properly, but it remembers the old network and keeps hopping back).

From errors I’ve seen on GitHub, this one:

aiocoap.error.NetworkError: [Errno 113] received through errqueue

has usually been connectivity issues (dodgy wifi repeaters, RA’s not been handled properly, surprise border router on a different vlan, etc).

(EDIT: 113 is an OS error code for “no route to host” which is generally what I’ve seen)

1 Like

As I was beginning to debug this and follow your steps I rebooted home assistant OS (core 2022.11.X) to set up root SSH authorized_keys, and sure enough when I checked homekit controller all the devices were back on the network working properly now… I still have the

and self.enc_ctx.send_ctr - self.enc_ctx.recv_ctr < 1

patch applied but this was the first time I observed any difference in behavior over the past 2 days.

I also ran these commands to verify:

grep ^Version /usr/local/lib/python3.10/site-packages/aiocoap-*/METADATA

Returns:

Version: 0.4.4

And this

grep -n -C 3 3000 /usr/local/lib/python3.10/site-packages/aiocoap/transports/ws.py

Returns:


3 3000 /usr/local/lib/python3.10/site-packages/aiocoap/transports/ws.py
50-
51-  Due to a shortcoming of aiocoap's way of specifying ports to bind
52-  to, if a port is explicitly stated to bind to, CoAP-over-WS will bind to th                                       at
53:  port plus 3000 (resulting in the abovementioned 8683 for 5683). If TLS serv                                       er
54-  keys are given, the TLS server is launched on the next port after the HTTP
55-  server (typically 8684).
56-"""
--
211-                port = 8683
212-            elif port != 0:
213-                # FIXME see module documentation
214:                port = port + 3000
215-
216-            server = await websockets.serve(
217-                    functools.partial(self._new_connection, scheme='coap+ws'                                       ),
/usr/local/lib/python3.10/site-packages/aiohomekit/controller/coap #

We’ll see how long this behavior lasts for. Fingers crossed!

My aiocap Version is 0.4.4 as well.

I have just implemented your fix with the “timeout=45.0” (which was harder than you might think without any knowledge about how to access it). Just did a reboot, verified the timeout is still 45.0. At least at the moment the devices are still unavailable, but pingable. (only checked that with one device though but at least that device should be available then)