Avoided the dreaded silent failures

puterboy · August 29, 2024, 9:12pm

One of my biggest fears with installations like HA is the occurrence of siltent failures – where one or more sensors or automations fail without anything or anybody noticing it.

For individual sensors, I solved the problem by doing the following 2 things

Creating automations that notify me if a sensor fails to send data for a prescribed amount of time. Here is a representative stanza:
Note: you can add as many sensors as you want and change the time-before-warning for each one in minutes (since I divide by 60)

 id: "171134823020"
  alias: No Recent Sensor Update
  description: Sensor hasn't been updated recently
  trigger:
    - platform: template
      id: Living Room Temperature [sensor.nexus_th_livingrm_temperature]
      value_template:
        "{{ (as_timestamp(now()) - as_timestamp(states.sensor.nexus_th_livingrm_temperature.last_updated))/60
        > 120 }}"
    - platform: template
      id: Sam's Room Temperature [sensor.nexus_th_samrm_temperature]
      value_template:
        "{{ (as_timestamp(now()) - as_timestamp(states.sensor.nexus_th_samrm_temperature.last_updated))/60
        > 60 }}"
    - platform: template
      id: John's Room Temperature [sensor.nexus_th_johnrm_temperature]
      value_template:
        "{{ (as_timestamp(now()) - as_timestamp(states.sensor.nexus_th_johnrm_temperature.last_updated))/60
        > 240 }}"
...
...
action:
    - service: shell_command.get_latest_states
      data:
        sensor: "{{ trigger.id.partition('[')[2].partition(']')[0] }}"
        rows: 15
      response_variable: statehist
    - service: notify.my_email
      metadata: {}
      data:
        title: "Stale Sensor: {{ trigger.id }}"
        message: "{{ statehist['stdout'] }}"
    - service: notify.persistent_notification
      metadata: {}
      data:
        title: Sensor Offline/Unavailable/Unchanged
        message: "{{ trigger.id }}"
    - service: notify.mobile_app_pixel_7
      metadata: {}
      data:
        title: Sensor Offline/Unavailable/Unchanged
        message: "{{ trigger.id }}"

Where shell_command.get_latest_states is an optional script (you can cut that part out) that returns the last ‘rows’ entries from the states table for that sensor so you can see what has been going on historically too…

shell_command:
  get_latest_states: >
    ssh -o StrictHostKeychecking=no -i /config/.ssh/id_rsa <your_user_name>@localhost "
      sqlite3 /homeassistant/home-assistant_v2.db \"
        WITH ValidStates AS (SELECT state_id, states.metadata_id, state, last_updated_ts 
          FROM states LEFT JOIN states_meta ON (states.metadata_id=states_meta.metadata_id)
          WHERE states_meta.entity_id = '{{ sensor }}' AND state NOT IN ('unknown', 'unavailable'))
        SELECT *, DATETIME(last_updated_ts, 'unixepoch', 'localtime'), 
          CAST(ROUND((
            CASE 
              WHEN LAG(last_updated_ts, -1) OVER(ORDER BY state_id) IS NULL THEN strftime('%s','now') 
              ELSE LAG(last_updated_ts, -1) OVER(ORDER BY state_id) END 
            - last_updated_ts)
            /60,0) AS INT)
        FROM ValidStates ORDER BY state_id DESC LIMIT {{ rows }}
    \""

which uses some special foo with ssh to get access to home_assistant_v2.db to read the states where you need to place an rsa key in say /homeassistant/ and use your ssh user name.

As another check, I run the following command once a day from my Linux server to find states that are either stale now or that have large gaps over the past 2r hours. In this case if older than 5 hours.
First check for any currently stale sensors:

STALEMINS=300
SENSORS=(<list of metadata_id's to check>)
echo "#### Stale Sensors (>$STALEMINS mins) ####"
SCRIPT='for sensorID in '${SENSORS[@]}'; do
        OUT="$(sqlite3 /homeassistant/home-assistant_v2.db 
                "SELECT CAST((strftime('"'%s','now'"')-last_updated_ts)/60 AS INT), states_meta.entity_id
                 FROM states LEFT JOIN states_meta ON(states.metadata_id=states_meta.metadata_id) 
                 WHERE states.metadata_id=$sensorID AND states.state NOT IN('"'unknown'"', '"'unavailable'"') 
                 ORDER BY state_id DESC LIMIT 1")";
        MINUTES=${OUT%|*};
        SENSOR=${OUT#*|};
        [ $MINUTES -gt '$STALEMINS' ] && echo -e "$MINUTES\t${sensorID}|$SENSOR";
        done'
echo $SCRIPT | ssh homeassistant bash | sort -n

Then check for any gaps over the past 24 hours >$STALEMINS

echo -e "#### Max time between updates over past 24 hours (>$STALEMINS mins) ####"
SCRIPT='for sensorID in '${SENSORS@]}'; do
        OUT="$(sqlite3 /homeassistant/home-assistant_v2.db 
                "WITH ValidStates AS (SELECT last_updated_ts, state, metadata_id 
                 FROM states 
                 WHERE metadata_id = $sensorID AND state NOT IN('"'unknown'"', '"'unavailable'"')
                 ORDER BY last_updated_ts DESC LIMIT 2) 
                 SELECT CAST((last_updated_ts - LAG(last_updated_ts, 1) 
                 OVER(ORDER BY last_updated_ts))/60 AS INT), entity_id 
                 FROM ValidStates LEFT JOIN states_meta ON(ValidStates.metadata_id=states_meta.metadata_id) 
                ORDER BY last_updated_ts DESC LIMIT 1")";
        MINUTES=${OUT%|*};
        SENSOR=${OUT#*|};
        [ -n "$MINUTES" ] && [ $MINUTES -gt '$STALEMINS' ] && echo -e "$MINUTES\t${sensorID}|$SENSOR";
        done;'
echo $SCRIPT | ssh homeassistant bash | sort -n

But what if:

homeassistant itself crashes
Or you “forget” to restart it after shutting it down for maintenance
Or has happened to me last night the database corrupted, shrunk to essentially zero and stopped recording data – even though HA was seemingly still running.

So, I created the following crontab one-liner that I run once an hour to do the following:

Check that homeassistant is still running
Check that home-assistant_v2.db is larger than X MB (100MB for me)
Copy the key data (home-assistant_v2.db and .storage directories) to rotating (local) backup directories. – I put them in /backup/hourlies

If any one of them fails, cron sends me an email. Note that escaping is a little weird for crontab in that ‘%’ also needs escaping.

00 * * * * bash -l -c 'PS1=1; DB=home-assistant_v2.db; DIR=/homeassistant; BACKDIR=/backup/hourlies; ssh homeassistant "source /etc/pofile.d/homeassistant.sh; { STATUS=\$(ha core status 2>/dev/null) && SIZE=\"\$(stat -c\%s $DIR/$DB 2>/dev/null)\"  && [ \"\$SIZE\" -gt \$((100*1024*1024)) ] && sudo cp -a $DIR/$D\
B $BACKDIR/$DB-\$(date +'\%H') && sudo cp -a $DIR/.storage/. $BACKDIR/.storage-\$(date +'\%H'); } || echo -e \"HomeAssistant: \$STATUS\n\n$DB: \$SIZE bytes\""'

where ‘/etc/profile.d/homeassistant.sh’ contains the SUPERVISOR_TOKEN needed to have the root access.

So now in summary, I get notified if:

One or more individual sensors are not reporting in a timely manner
HA or it’s db are not working properly
Regardless, I get rotating hourly backups so even if things go wrong, I shouldn’t lose too many minutes of data.