Zha ping monitor

newmail · April 22, 2023, 12:02am

Hello everyone,

this is all AGPLv3, as the license also says, use all information & code & yaml in this post at your own risk.

I’d like to be notified as soon as possible when a zigbee device that is connected to main goes offline. I’ve no idea how reliable these lights and switches will be in the long term, so if the device isn’t responding because it is defective, I prefer to be notified promptly, just in case.

There’s probably other ways to achieve this. I suppose it would be possible to lower the ZHA timeout of non battery devices from 7200 to <=300 and then listen for unavailable states, but I guess that’s set to 7200 for a reason, or not? It’s not clear how many times it would be guaranteed to try to ping during those 300 sec, in case of temporary packet loss with interference. Having a logic that can be tweaked as needed in case of packet loss, and doesn’t depend on the ZHA timeout internals, to me appeared more robust and future proof.

If somebody can find a way to create a blueprint out of this and to make it more self contained it would be great. Another direction would to make it all python which would also make it much simpler with a simple self contained script, and we could hold the state within python_script using time.sleep too (and it wouldn’t risk to overflow), if only there was a way to listen to the events, but then I read time.sleep would slowdown the core so maybe it’s better not to use time.sleep anyway, dunno.

The end result of all the above constraints is that the procedure to install this automation is very manual with one helper, two automations, two scripts, plus one external component:

install zha toolkit
create an input_text helper, better defined to the maximum size 255 (why so short…). I called it input_text.zha_toolkit_ping_alarm_helper
enable python_script: and zha_toolkit: in configuration.yaml
install the Uptime sensor integration
paste the below in ~/config/python_scripts/zha_toolkit_ping_alarm_send.py

entity_ids = data.get('entity_ids')
max_tries = data.get('max_tries')
helper = data.get('helper')

if not helper or hass.states.get(helper) is None:
    logger.warning('Missing helper')
elif max_tries is None:
    logger.warning('Missing max tries')
elif max_tries < 1:
    logger.warning(f'Wrong max tries {max_tries}')
elif entity_ids is None:
    logger.warning('Missing entity_ids')
else:
    for entity_id in set(entity_ids):
        if hass.states.get(entity_id) is None:
            logger.warning(f'Not found entity_id: {entity_id}')
            continue
        service_data= {
                'ieee': entity_id,
                'event_done': 'zha_toolkit_ping_alarm',
                'args' : [ 1, max_tries ],
                }
        hass.services.call("zha_toolkit", "ieee_ping", service_data, blocking=False)

paste the below in ~/config/python_scripts/zha_toolkit_ping_alarm_recv.py

ieee_org = data.get('ieee_org')
success = data.get('success')
tries = data.get('tries')
max_tries = data.get('max_tries')
helper = data.get('helper')

if not helper or hass.states.get(helper) is None:
    logger.warning('Missing helper')
elif max_tries is None:
    logger.warning('Missing max tries')
elif max_tries < 1:
    logger.warning(f'Wrong max tries {max_tries}')
elif tries is None:
    logger.warning('Missing tries')
elif tries < 1 or tries > max_tries:
    logger.warning(f'Wrong tries {tries}')
elif ieee_org is None or hass.states.get(ieee_org) is None:
    logger.warning('Missing ieee_org')
elif success is None or success not in (True, False):
    logger.warning('Missing success')
else:
    helper_state = hass.states.get(helper)
    assert(helper_state.attributes['editable'] == True)
    assert(helper_state.attributes['min'] == 0)
    assert(helper_state.attributes['pattern'] is None)
    assert(helper_state.attributes['mode'] == 'text')
    len_max = helper_state.attributes['max']

    last_offline = set(helper_state.state.split())
    helper_update = False
    for offline in last_offline.copy():
        if hass.states.get(offline) is None:
            logger.warning(f'Discarding {offline} from helper')
            last_offline.discard(offline)
            helper_update = True
    if success:
        if ieee_org in last_offline:
            last_offline.remove(ieee_org)
            helper_update = True
            logger.warning(f'Online {ieee_org}')
    else:
        if tries >= max_tries:
            friendly_name = hass.states.get(ieee_org).attributes['friendly_name']
            service_data= {
                    'title': 'Ping Alarm',
                    'message': f'Offline: {friendly_name}',
                    }
            hass.services.call("notify", "persistent_notification", service_data, blocking=False)
            logger.warning(f'Offline {ieee_org} {friendly_name}')
            last_offline.add(ieee_org)
            helper_update = True
        else:
            if ieee_org not in last_offline:
                service_data= {
                        'ieee': ieee_org,
                        'event_done': 'zha_toolkit_ping_alarm',
                        'args' : [ tries + 1, max_tries ],
                        }
                hass.services.call("zha_toolkit", "ieee_ping", service_data, blocking=False)

    if helper_update:
        new_offline = ' '.join(last_offline)
        if len(new_offline) > len_max:
            logger.warning('too long offline string, truncating')
            new_offline = new_offline[:len_max]
        hass.states.set(helper, new_offline, helper_state.attributes)

add the below automation to ~/config/automations.yaml to define the interval of the ping (default 5 min) and replace light.abc and light.def with the list of entity_ids of the devices you need to monitor. You can also easily tweak the “max_tries” parameter if you prefer more or less tolerance for packet loss.

- alias: zha toolkit ping alarm send
  description: ''
  trigger:
  - platform: time_pattern
    minutes: /5
  condition:
  - condition: template
    value_template: '{{ as_timestamp(now()) - as_timestamp(states.sensor.uptime.last_changed)
      | int > 600 }}'
  action:
  - service: python_script.zha_toolkit_ping_alarm_send
    data:
      entity_ids:
      - light.abc <- edit this and add more or less entries as needed
      - light.def <- edit this and add more or less entries as needed
      helper: input_text.zha_toolkit_ping_alarm_helper
      tries: 0
      max_tries: 10
  mode: single

add the below automation to ~/config/automations.yaml to listen to the trigger for the pong or timeout, the retires timeout is also easy to tweak, 5 sec by default. With more than 10 devices to ping you should increase the max parallelism.

- alias: zha toolkit ping alarm recv
  trigger:
  - platform: event
    event_type: zha_toolkit_ping_alarm
    event_data:
      command: ieee_ping
      params:
        event_done: zha_toolkit_ping_alarm
  condition: []
  action:
  - if:
    - condition: template
      value_template: '{{trigger.event.data.success == false}}'
    then:
    - delay:
        hours: 0
        minutes: 0
        seconds: 5
        milliseconds: 0
  - service: python_script.zha_toolkit_ping_alarm_recv
    data:
      ieee_org: '{{trigger.event.data.ieee_org}}'
      success: '{{trigger.event.data.success}}'
      tries: '{{trigger.event.data.params.args[0]}}'
      max_tries: '{{trigger.event.data.params.args[1]}}'
      helper: input_text.zha_toolkit_ping_alarm_helper
  mode: parallel
  max: 10

you may want another automation that forwards call_service.persistent_notification with the given title to notify.notify or you can directly edit the python script to invoke any other notification service of your choice
you may want to decrease the amount of recording related to these events by tweaking the recorder: setting in configuration.yaml:

recorder:
  exclude:
    entities:
      - automation.zha_toolkit_ping_alarm_send
      - automation.zha_toolkit_ping_alarm_recv
    event_types:
      - zha_toolkit_ping_alarm

with too many devices (more than 255 char worth of entity_ids) going offline at once, the helper will overflow, which supposedly will only cause Offline notification dups, but the overflow code path is untested

newmail · April 27, 2023, 9:14pm

Here’s an alternative implementation that relies on the last_seen background ping and checks all devices with power_source == Mains. Checking every 5 min for a 5 min timeout means it may notify 10min after the device really went offline.

install zha_toolkit >= v0.8.39
add the below automation to ~/config/automations.yaml

- id: '1682291917918'
  alias: last seen monitor
  description: ''
  trigger:
  - platform: time_pattern
    id: time
    enabled: true
    minutes: /5
  - platform: event
    event_type: zha_toolkit_last_seen_monitor
    event_data:
      command: zha_devices
      success: true
      params:
        event_success: zha_toolkit_last_seen_monitor
    id: event
  condition:
  - condition: template
    value_template: '{{ as_timestamp(now()) - as_timestamp(states.sensor.uptime.last_changed)
      | int > 600 }}'
    enabled: true
  action:
  - choose:
    - conditions:
      - condition: trigger
        id: time
      sequence:
      - service: zha_toolkit.zha_devices
        data:
          event_success: zha_toolkit_last_seen_monitor
          command_data:
          - entities
          - power_source
          - last_seen
          - available
    - conditions:
      - condition: trigger
        id: event
      sequence:
      - service: python_script.last_seen_monitor
        data:
          devices: '{{ trigger.event.data.devices }}'
          helper: input_text.last_seen_monitor
          timeout_seconds: 300
  mode: queued
  max: 2

paste the below in ~/config/python_scripts/last_seen_monitor.py

entity_ids = data.get('entity_ids')
devices = data.get('devices')
helper = data.get('helper')
timeout_seconds = data.get('timeout_seconds')

title='Last Seen Monitor'

def error(text, raise_exception = True):
    text = 'failure: ' + text
    if helper_state.state != text:
        service_data= {
            'title': title,
            'message': text,
        }
        hass.services.call("notify", "persistent_notification", service_data, blocking=False)
        hass.states.set(helper, text, helper_state.attributes)
    if raise_exception:
        raise Exception(text)

assert(helper)
helper_state = hass.states.get(helper)
assert(helper_state.attributes['editable'] == True)
assert(helper_state.attributes['min'] == 0)
assert(helper_state.attributes['pattern'] is None)
assert(helper_state.attributes['mode'] == 'text')
len_max = helper_state.attributes['max']
assert(len_max >= 100)

if entity_ids:
    for entity_id in entity_ids:
        if not hass.states.get(entity_id):
            error('entity_id not found')

if timeout_seconds <= 0:
    error('invalid timeout_seconds')

try:
    helper_update = False
    last_offline = set(helper_state.state.split())

    def check_device(device, now, helper_update, entity_id):
        available = device['available']
        last_seen = device['last_seen']
        if not last_seen:
            available = False
        else:
            if now is None:
                now = dt_util.as_timestamp(dt_util.now())
            if now - dt_util.as_timestamp(last_seen) > timeout_seconds:
                available = False
        if not available and entity_id not in last_offline:
            friendly_name = hass.states.get(entity_id).attributes['friendly_name']
            service_data= {
                'title': title,
                'message': f'Offline: {friendly_name} - {entity_id}',
            }
            hass.services.call("notify", "persistent_notification", service_data, blocking=False)
            logger.warning(f'Offline {friendly_name} - { entity_id }')
            last_offline.add(entity_id)
            helper_update = True
        elif available and entity_id in last_offline:
            last_offline.remove(entity_id)
            helper_update = True
        logger.debug(f'{entity_id}, {hass.states.get(entity_id).attributes["friendly_name"]}, {available}, {last_seen}')
        return now, helper_update

    now = None
    if entity_ids:
        for entity_id in set(entity_ids):
            for device in devices:
                if entity_id in (x['entity_id'] for x in device['entities']):
                    now, helper_update = check_device(device, now, helper_update, entity_id)
                    break
            else:
                error('entity_id not found')
    else:
        for device in devices:
            if device['power_source'] == 'Mains':
                entities = device['entities']
                if not entities:
                    continue
                entities = (x['entity_id'] for x in device['entities'])
                entity_id = sorted(entities, key=lambda x: len(x))[0]
                now, helper_update = check_device(device, now, helper_update, entity_id)
except:
    error('exception', raise_exception = False)
    raise
finally:
    for offline in last_offline.copy():
        if not hass.states.get(offline):
            last_offline.discard(offline)
            helper_update = True
    if helper_update:
        new_offline = ' '.join(last_offline)
        if len(new_offline) > len_max:
            logger.warning('too long offline string, truncating')
            new_offline = new_offline[:len_max]
        hass.states.set(helper, new_offline, helper_state.attributes)

create an input_text helper, better defined to the maximum size 255 (why so short…). I called it input_text.last_seen_monitor
the python_script.last_seen_monitor optionally takes a entitiy_ids: list of entity_ids to check that overrides the power_source filter on Mains
you may want to decrease the amount of recording related to these events by tweaking the recorder: setting in configuration.yaml:

recorder:
  exclude:
    entities:
      - automation.last_seen_monitor
    event_types:
      - zha_toolkit_last_seen_monitor

all other dependencies and requirements are the same as for the ping alarm automation of the first post