End device randomly stops reporting despite having good coverage

I have a battery device that reports energy usage (metering->instantaneous demand/current_consumption_delivered) and it is unstable. It can report fine for hours and then suddenly stop for hours, and then come back and start it all over. I cannot see a pattern in when the reporting works and not.

The interesting thing is that in the periods where it stopped reporting, I can still send requests to the end device and it will answer a few seconds later. So it is alive and in the network. I send the request basic->manufacturer which is always answered by the device.

This however doesn’t start the reporting again.

But as soon as I send a request for metering->instantaneous demand/current_consumption_delivered, the reporting starts working again for some hours.

What could make the device act like that?

The coverage is good, I thinkg. It is 135 to the nearest router according to the network visualization (but LQI on device info can report anything from 10 to 100; not sure how to interpret this) and the router is reporting constantly, i.e. is not offline.

I guess the device awaits an acknowledgement to the reporting and will stop reporting if too many acknowledgements are lost. I seem to remember, having worked with this device in another context, that it will stop reporting if acknowledgements aren’t received.

Are acknowledgements being retried if they aren’t received by the device in the first attempt?

I would like to log all messages to/from this device, but I don’t know if that is possible. Logging all of zigpy/HA for days will produce gigabytes of log and be next to impossible to look through, especially when I don’t know what exactly I am looking for.

The device in question is:

ZHEMI - ZigBee External Meter Interface
by Develco Products A/S
Connected via [Texas Instruments CC2652]

How many router devices do you have? LQI down to 10 can indicate a poor mesh. I would try adding a few routers.

I have 10 routers. The nearest is about 1 meter away. I don’t understand why the LQI can vary this much…

Also, what are the numbers on the network visualization for?

I can even reconfigure the device without the reporting starting up again. I must query the specific metering registers for reporting to start again. And again, communication works every time. Something makes the device stop, and I can only guess at the lack/failed of acknowledgements.

Not clear whether you’re using ZHA or Z2M. If the former, have you looked at the ZHA Toolkit? There may be something helpful there.

Again assuming ZHA, the network visualisation shows all the routes that Zigbee has discovered, not just the ones in use. It’s a snapshot, and the numbers are an average LQI based on recent messages. On links between routers there are two numbers, one reported by each. If you’re troubleshooting a particular device, the LQI on the device info page is more useful.

LQI is a measure of error rates. An LQI of 255 means that no messages have been lost between two points. An LQI of 10 means that nearly everything is being lost. The LQI will change because in a situation like that the routers involved will constantly be searching for a better connection - sometimes finding one, sometimes not.

The whole network is doing this, with connections being shuffled all the time. A strong mesh is one where there are enough possible routes for the routers always to find strong links with their neighbours. In a weak network they have to compete for connections and some devices sometimes lose out.

Ten routers is not many. The number you need will depend on the layout and structure of your house, but distance is not the issue, it’s the number of connections available. In my house - small with thick walls - I have three or four in every room; nine in one room.

The best strategy is to keep adding routers, one by one until it works :grin:. Let the network settle for a day or two after each addition.