Z-Wave Network Health - Template Sensor

I have a template sensor that is meant to give me the status of my z-wave network. It’s possible values are critical, severe, warning, minor, and ok. (I have several monitor type sensors that use these values)

Z-Wave status sensor
- sensor:
    - name: "ZWave Status"
      unique_id: zwave_status
      icon: mdi:z-wave
      state: >
        {% if is_state('binary_sensor.zwave_network', 'off') %} critical
        {% else %} {{ iif(states('sensor.offline_zwave_devices') | int(-1) > 0, 'warning', 'ok') }}
        {% endif %}

The state of critical is reserved for the entire z-wave network being unavailable and is determined by this template sensor.

Z-Wave Network Sensor
- trigger:
    - platform: homeassistant
      event: start

    - platform: event
      event_type: event_template_reloaded

    - platform: state
      entity_id: sensor.time
  binary_sensor:
    - name: "ZWave Network"
      unique_id: zwave_network
      icon: mdi:z-wave
      device_class: connectivity
      state: >
        {{ (is_state('binary_sensor.z_wave_js_running', 'on')
              or is_state('binary_sensor.z_wave_js_to_mqtt_running', 'on'))
            and is_state('sensor.zwave_controller_status', 'ready')
            and integration_entities('zwave_js') | select('has_value') | list | count > 0 }}

The state of warning is used if there are one or more offline z-wave devices.

Offline Z-Wave devices sensor
- sensor:
    - name: "Offline ZWave Devices"
      unique_id: offline_zwave_devices
      icon: mdi:z-wave
      unit_of_measurement: devices
      state: >
        {% set entities = state_attr(this.entity_id, 'entity_id') %}
        {{ -1 if entities == none else entities | count }}
      attributes:
        entity_id: >
          {{ expand(integration_entities('zwave_js'))
              | selectattr('entity_id', 'contains', 'node_status')
              | selectattr('state', 'in', ['dead', 'unavailable', 'unknown'])
              | map(attribute="object_id")
              | map('regex_replace', find='(.*)_node_status', replace='button.\\1_ping', ignorecase=False)
              | list | sort }}

What I would like to do is use the sensors provided for the z-wave hub to set values for the status sensor that make sense for the states of severe, warning, and minor. I have a basic understanding of what most of these sensors are, I really don’t know enough about what the values of these sensors should be to classify them properly.

Z-Wave Controller sensors

sensor.z_stick_gen5_usb_controller_messages_dropped_tx
sensor.z_stick_gen5_usb_controller_messages_dropped_rx
sensor.z_stick_gen5_usb_controller_messages_not_accepted
sensor.z_stick_gen5_usb_controller_collisions
sensor.z_stick_gen5_usb_controller_missing_acks
sensor.z_stick_gen5_usb_controller_timed_out_responses
sensor.z_stick_gen5_usb_controller_timed_out_callbacks
sensor.z_stick_gen5_usb_controller_average_background_rssi_channel_0
sensor.z_stick_gen5_usb_controller_current_background_rssi_channel_0

Is there anyone who would be willing to have a quick look and classify these sensors like this for me?

missing_acks: severe > 25, warning > 15, minor > 5

While node status is helpful. It is insufficient to determine the health of a node. Nodes are marked dead when a command fails. So if you aren’t sending commands or it is a passive device like a temperature sensor it won’t be dead.

For my battery powered devices I set the wake-up interval to 14400 and then if the node status never goes to “awake” within 28800 I generate an alert.

For devices that periodically send data. For example a temperature sensor that sends temperature every 10 minute, if I don’t receive a js update for temperature within 30 minutes I generate an alert.

For switches, fan controls. I periodically poll them (30 minutes) and if I don’t get a js update within 90 minutes I generate an alert. Typically this also forces them to a dead state if the poll command can’t be delivered.

Here is an example for a temperature sensor. Looks like if I done hear from it within 480 seconds it’s dead.

input_boolean:
  al_zw_basement_sensor_alert_enable:
    name: zw_basement_sensor alert enable
binary_sensor:
  - platform: template
    sensors:
      al_zw_basement_sensor_alert:
        value_template: '{{ (is_state("binary_sensor.zw_basement_sensor_online", "off")) and is_state("input_boolean.al_zw_basement_sensor_alert_enable", "on") }}'
  - platform: template
    sensors:
      zw_basement_sensor_online:
        value_template: >-
          {{ ( 480 - (states('sensor.zw_basement_sensor_latency') | int(0))) > 0  and

               states('sensor.basement_sensor_node_status') != 'dead' and
               states('sensor.basement_sensor_node_status') != 'unavailable' and
               states('sensor.basement_sensor_node_status') != 'unknown'
          }}
alert:
  al_zw_basement_sensor:
    name: zw_basement_sensor
    message: 'ALERT {{state_attr("zone.home","friendly_name")}} zw_basement_sensor Latency: {{ states("sensor.zw_basement_sensor_latency")}} Node {{ states("sensor.basement_sensor_node_status")}}'
    done_message: 'Cleared {{state_attr("zone.home","friendly_name")}} zw_basement_sensor Latency: {{ states("sensor.zw_basement_sensor_latency")}} Node {{ states("sensor.basement_sensor_node_status")}}'
    entity_id: binary_sensor.al_zw_basement_sensor_alert
    state: "on"
    repeat:
      - 1
      - 240
    can_acknowledge: true
    skip_first: true
    notifiers:
      - sms_notifiers_all
      - sms_telegram_admin
recorder:
  include:
    entities:
      - binary_sensor.al_zw_basement_sensor_alert
      - alert.al_zw_basement_sensor
      - input_boolean.al_zw_basement_sensor_alert_enable
      - binary_sensor.zw_basement_sensor_online
      - sensor.basement_sensor_node_status
template:
  - trigger:
      - platform: homeassistant
        event: start
      - platform: zwave_js.value_updated
        entity_id:
          - sensor.basement_temperature
        command_class: 49
        property: Air temperature
    sensor:
      - name: "zw_basement_sensor_last_updated"
        state: "{{ now() }} "
sensor:
  - platform: template
    sensors:
      zw_basement_sensor_latency:
        unit_of_measurement: secs
        value_template: >-
          {%- if as_timestamp(states("sensor.zw_basement_sensor_last_updated"),0) == 0 %} 0 {%- else %} {{ ((as_timestamp(now(), 0) | int(0)) - as_timestamp(states('sensor.zw_basement_sensor_last_updated'),0) | int(0)) }} {%- endif %}
1 Like

Interesting approach for an individual node sensor, thanks for throwing it out there! The node sensor has been adequate for me. Since I automated pinging dead nodes, haven’t really had an issue with devices staying offline.

What I’m more looking for here is an overall zwave network health sensor, hence providing the sensors for the zwave controller rather than a device. To be honest I’m not even sure this is a good approach but it’s better than nothing?

Anyway, Cunningham’s Law and all that… here’s what I have for the sensor. The numbers I’ve used are obviously entirely made up. I was hoping that someone with a more in depth knowledge of zwave networks would be able to help me with some sensible values.

- sensor:
    - name: "ZWave Status"
      unique_id: zwave_status
      icon: mdi:z-wave
      state: >
        {% set offline = states('sensor.offline_zwave_devices') | int(-1) > 0 %}
        {% set dropped_tx = states('sensor.z_stick_gen5_usb_controller_messages_dropped_tx') | int(-1) %}
        {% set dropped_rx = states('sensor.z_stick_gen5_usb_controller_messages_dropped_rx') | int(-1) %}
        {% set msg_na = states('sensor.z_stick_gen5_usb_controller_messages_not_accepted') | int(-1) %}
        {% set collisions = states('sensor.z_stick_gen5_usb_controller_collisions') | int(-1) %}
        {% set mssing_acks = states('sensor.z_stick_gen5_usb_controller_missing_acks') | int(-1) %}
        {% set to_response = states('sensor.z_stick_gen5_usb_controller_timed_out_responses') | int(-1) %}
        {% set to_callback = states('sensor.z_stick_gen5_usb_controller_timed_out_responses') | int(-1) %}
        {% set avg_rssi = states('sensor.z_stick_gen5_usb_controller_average_background_rssi_channel_0') | int(-1) %}
        {% set current_rssi = states('sensor.z_stick_gen5_usb_controller_current_background_rssi_channel_0') | int(-1) %}

        {% if is_state('binary_sensor.zwave_network', 'off') %} critical
        {% elif dropped_tx > 50
            or dropped_rx > 50
            or msg_na > 50
            or collisions > 50
            or mssing_acks > 50
            or to_response > 50
            or to_callback > 50
            or avg_rssi > 50
            or current_rssi > 50 %} severe
        {% elif offline
            or dropped_tx > 30
            or dropped_rx > 30
            or msg_na > 30
            or collisions > 30
            or mssing_acks > 30
            or to_response > 30
            or to_callback > 30
            or avg_rssi > 30
            or current_rssi > 10 %} warning
        {% elif offline
            or dropped_tx > 10
            or dropped_rx > 10
            or msg_na > 10
            or collisions > 10
            or mssing_acks > 10
            or to_response > 10
            or to_callback > 10
            or avg_rssi > 10
            or current_rssi > 10 %} minor
        {% else %} ok
        {% endif %}

A challenge is all those counters count up indefinitely and then get reset when restarting zwavejs.

The last elif will never execute as “offline or” is present in the prior condition.

Here’s the stats from one of my system that has been running for a month or so. There are some error counts but they are at a low level - however they are close to your thresholds. Whereas if we could look at the rate of increase that would be helpful. So 8 timeouts per month is different than 8 timeouts in the last minute. The former is expected behavior and the latter is an emerging high priority issue. Some level of errors is expected. When I run the microwave it with some devices on the other side of the wall and causes them to have to find a different route - but the mesh adapts and all is fine.

Aha. This is the the info I was looking for! I wasn’t sure how often the sensors reset. But yeah, looks like I’m probably barking up the wrong tree here unless I run them all through utility meters or something. Likely too much trouble for the expected benefit here at the end of the day, especially since I’m not having issues. It was more of a “because I can” thing.

D’oh! Tis a slly copy paste error but it would not have affected the result in this case. If offline is true it stops at the warning level and if it is false it would still continue to the minor level evaluation.

Anyway, I will say that Mr. Cunningham was spot on with his theory :wink:

Thanks for looking at it.