One of my biggest fears with installations like HA is the occurrence of siltent failures – where one or more sensors or automations fail without anything or anybody noticing it.
For individual sensors, I solved the problem by doing the following 2 things
- Creating automations that notify me if a sensor fails to send data for a prescribed amount of time. Here is a representative stanza:
Note: you can add as many sensors as you want and change the time-before-warning for each one in minutes (since I divide by 60)
id: "171134823020"
alias: No Recent Sensor Update
description: Sensor hasn't been updated recently
trigger:
- platform: template
id: Living Room Temperature [sensor.nexus_th_livingrm_temperature]
value_template:
"{{ (as_timestamp(now()) - as_timestamp(states.sensor.nexus_th_livingrm_temperature.last_updated))/60
> 120 }}"
- platform: template
id: Sam's Room Temperature [sensor.nexus_th_samrm_temperature]
value_template:
"{{ (as_timestamp(now()) - as_timestamp(states.sensor.nexus_th_samrm_temperature.last_updated))/60
> 60 }}"
- platform: template
id: John's Room Temperature [sensor.nexus_th_johnrm_temperature]
value_template:
"{{ (as_timestamp(now()) - as_timestamp(states.sensor.nexus_th_johnrm_temperature.last_updated))/60
> 240 }}"
...
...
action:
- service: shell_command.get_latest_states
data:
sensor: "{{ trigger.id.partition('[')[2].partition(']')[0] }}"
rows: 15
response_variable: statehist
- service: notify.my_email
metadata: {}
data:
title: "Stale Sensor: {{ trigger.id }}"
message: "{{ statehist['stdout'] }}"
- service: notify.persistent_notification
metadata: {}
data:
title: Sensor Offline/Unavailable/Unchanged
message: "{{ trigger.id }}"
- service: notify.mobile_app_pixel_7
metadata: {}
data:
title: Sensor Offline/Unavailable/Unchanged
message: "{{ trigger.id }}"
Where shell_command.get_latest_states
is an optional script (you can cut that part out) that returns the last ‘rows’ entries from the states
table for that sensor so you can see what has been going on historically too…
shell_command:
get_latest_states: >
ssh -o StrictHostKeychecking=no -i /config/.ssh/id_rsa <your_user_name>@localhost "
sqlite3 /homeassistant/home-assistant_v2.db \"
WITH ValidStates AS (SELECT state_id, states.metadata_id, state, last_updated_ts
FROM states LEFT JOIN states_meta ON (states.metadata_id=states_meta.metadata_id)
WHERE states_meta.entity_id = '{{ sensor }}' AND state NOT IN ('unknown', 'unavailable'))
SELECT *, DATETIME(last_updated_ts, 'unixepoch', 'localtime'),
CAST(ROUND((
CASE
WHEN LAG(last_updated_ts, -1) OVER(ORDER BY state_id) IS NULL THEN strftime('%s','now')
ELSE LAG(last_updated_ts, -1) OVER(ORDER BY state_id) END
- last_updated_ts)
/60,0) AS INT)
FROM ValidStates ORDER BY state_id DESC LIMIT {{ rows }}
\""
which uses some special foo with ssh to get access to home_assistant_v2.db to read the states where you need to place an rsa key in say /homeassistant/ and use your ssh user name.
- As another check, I run the following command once a day from my Linux server to find states that are either stale now or that have large gaps over the past 2r hours. In this case if older than 5 hours.
First check for any currently stale sensors:
STALEMINS=300
SENSORS=(<list of metadata_id's to check>)
echo "#### Stale Sensors (>$STALEMINS mins) ####"
SCRIPT='for sensorID in '${SENSORS[@]}'; do
OUT="$(sqlite3 /homeassistant/home-assistant_v2.db
"SELECT CAST((strftime('"'%s','now'"')-last_updated_ts)/60 AS INT), states_meta.entity_id
FROM states LEFT JOIN states_meta ON(states.metadata_id=states_meta.metadata_id)
WHERE states.metadata_id=$sensorID AND states.state NOT IN('"'unknown'"', '"'unavailable'"')
ORDER BY state_id DESC LIMIT 1")";
MINUTES=${OUT%|*};
SENSOR=${OUT#*|};
[ $MINUTES -gt '$STALEMINS' ] && echo -e "$MINUTES\t${sensorID}|$SENSOR";
done'
echo $SCRIPT | ssh homeassistant bash | sort -n
Then check for any gaps over the past 24 hours >$STALEMINS
echo -e "#### Max time between updates over past 24 hours (>$STALEMINS mins) ####"
SCRIPT='for sensorID in '${SENSORS@]}'; do
OUT="$(sqlite3 /homeassistant/home-assistant_v2.db
"WITH ValidStates AS (SELECT last_updated_ts, state, metadata_id
FROM states
WHERE metadata_id = $sensorID AND state NOT IN('"'unknown'"', '"'unavailable'"')
ORDER BY last_updated_ts DESC LIMIT 2)
SELECT CAST((last_updated_ts - LAG(last_updated_ts, 1)
OVER(ORDER BY last_updated_ts))/60 AS INT), entity_id
FROM ValidStates LEFT JOIN states_meta ON(ValidStates.metadata_id=states_meta.metadata_id)
ORDER BY last_updated_ts DESC LIMIT 1")";
MINUTES=${OUT%|*};
SENSOR=${OUT#*|};
[ -n "$MINUTES" ] && [ $MINUTES -gt '$STALEMINS' ] && echo -e "$MINUTES\t${sensorID}|$SENSOR";
done;'
echo $SCRIPT | ssh homeassistant bash | sort -n
But what if:
homeassistant
itself crashes- Or you “forget” to restart it after shutting it down for maintenance
- Or has happened to me last night the database corrupted, shrunk to essentially zero and stopped recording data – even though HA was seemingly still running.
So, I created the following crontab one-liner that I run once an hour to do the following:
- Check that
homeassistant
is still running - Check that
home-assistant_v2.db
is larger than X MB (100MB for me) - Copy the key data (
home-assistant_v2.db
and.storage
directories) to rotating (local) backup directories. – I put them in/backup/hourlies
If any one of them fails, cron sends me an email. Note that escaping is a little weird for crontab in that ‘%’ also needs escaping.
00 * * * * bash -l -c 'PS1=1; DB=home-assistant_v2.db; DIR=/homeassistant; BACKDIR=/backup/hourlies; ssh homeassistant "source /etc/pofile.d/homeassistant.sh; { STATUS=\$(ha core status 2>/dev/null) && SIZE=\"\$(stat -c\%s $DIR/$DB 2>/dev/null)\" && [ \"\$SIZE\" -gt \$((100*1024*1024)) ] && sudo cp -a $DIR/$D\
B $BACKDIR/$DB-\$(date +'\%H') && sudo cp -a $DIR/.storage/. $BACKDIR/.storage-\$(date +'\%H'); } || echo -e \"HomeAssistant: \$STATUS\n\n$DB: \$SIZE bytes\""'
where ‘/etc/profile.d/homeassistant.sh’ contains the SUPERVISOR_TOKEN
needed to have the root access.
So now in summary, I get notified if:
- One or more individual sensors are not reporting in a timely manner
- HA or it’s db are not working properly
Regardless, I get rotating hourly backups so even if things go wrong, I shouldn’t lose too many minutes of data.