Modifying data in the database via script?

swetzel · December 19, 2017, 2:55am

Thanks @VDRainer. I made a Python script to do what I wanted, go through my temperature data and delete extreme outliers. I’ll post the code here as a reference in case anyone wants to do something similar. If you want to run this script as is, put it in .homeassistant/scripts/scrub_db.py. Change the scrub_table() calls at the bottom to the entity ids you want to scrub. It should ask before deleting any rows. Be careful though, I barely tested this, no promises it will work and won’t delete all your data.

For now I’m manually running as these errors are pretty rare. It would be easy enough to make it run without interaction and set it to run with cron. But I want to test it more before I do that.

from sqlalchemy.ext.automap import automap_base
from sqlalchemy.orm import Session
from sqlalchemy import create_engine

engine = create_engine('sqlite:///../home-assistant_v2.db', echo=False)
Base = automap_base()
Base.prepare(engine, reflect=True)

State = Base.classes.states
Event = Base.classes.events

def scrub_table(session, entity_id, offset = 10, min_rows = 100, padding = 5):
    res = session.query(State).filter_by(entity_id = entity_id)
    print("Entity {} has {} rows".format(entity_id, res.count()))
    if res.count() < min_rows:
        print("Not enough rows")
        return
    # The idea here is if you have thousands of data points and only one or two are off by a lot
    # An example of temperature data points: (30, 32, 62, 63, 64, ... 65, 66), where 30 and 32 are errors, and we want to delete them
    # We take the 10th lowest point, subtract 5, and call that the cut off.  For my data that works well, but you will want to think about your data.
    # There are better ways to do this with standard deviations or deltas between points, but this was simple and works well for me.
    low_cut_off = float(res.order_by(State.state)[offset].state) - padding
    high_cut_off = float(res.order_by(State.state)[-1 - offset].state) + padding
    print("Low cut off {}, high cut off {}".format(low_cut_off, high_cut_off))
    delete_count = 0
    for row in res.filter(State.state > high_cut_off):
        print("Too high: {}".format(row.state))
        delete_count += 1
    for row in res.filter(State.state < low_cut_off):
        print("Too low: {}".format(row.state))
        delete_count += 1
    if delete_count == 0: return
    ans = input("Do you want to delete these {} rows [y/N]? ".format(delete_count))
    if ans == 'y':
        res.filter(State.state > high_cut_off).delete(synchronize_session=False)
        res.filter(State.state < low_cut_off).delete(synchronize_session=False)
        session.commit()


if __name__ == "__main__":
    session = Session(engine)
    scrub_table(session, 'sensor.bedroom_temperature')
    scrub_table(session, 'sensor.bedroom_humidity')
    scrub_table(session, 'sensor.living_room_temperature')
    scrub_table(session, 'sensor.living_room_humidity')
    scrub_table(session, 'sensor.dht_sensor_humidity')
    scrub_table(session, 'sensor.dht_sensor_temperature')

Example running:

Entity sensor.bedroom_temperature has 2691 rows
Low cut off 56.3, high cut off 70.1
Too low: 39.4
Do you want to delete these 1 rows [y/N]? y
Entity sensor.bedroom_humidity has 2895 rows
Low cut off 24.0, high cut off 42.5
Entity sensor.living_room_temperature has 540 rows
Low cut off 55.8, high cut off 72.3
Entity sensor.living_room_humidity has 1286 rows
Low cut off 23.6, high cut off 39.5
Entity sensor.dht_sensor_humidity has 10297 rows
Low cut off 20.5, high cut off 41.5
Entity sensor.dht_sensor_temperature has 10078 rows
Low cut off 42.7, high cut off 72.3