Hey all,
I’ve been looking into the recorder for a whlie, and recently migrated myself to postgresql/timescaledb. Learning more about timeseries, I noticed that currently, the recorder (using postgresql, not sure if sqlite would be any different) does currently store interval based data. E.g. it stores the same value/data multiple times. e.g.
homeassistant=# select * from statistics_short_term where metadata_id=22 limit 10;
id | created | created_ts | metadata_id | start | start_ts | mean | min | max | last_reset | last_reset_ts | state | sum
-------+---------+--------------------+-------------+-------+------------+-------+-------+-------+------------+---------------+-------+-----
31563 | | 1717294810.4162643 | 22 | | 1717294500 | 83.36 | 83.36 | 83.36 | | | |
31570 | | 1717295110.4101014 | 22 | | 1717294800 | 83.36 | 83.36 | 83.36 | | | |
31577 | | 1717295410.4212794 | 22 | | 1717295100 | 83.36 | 83.36 | 83.36 | | | |
31584 | | 1717295710.4160612 | 22 | | 1717295400 | 83.36 | 83.36 | 83.36 | | | |
31591 | | 1717296010.4175048 | 22 | | 1717295700 | 83.36 | 83.36 | 83.36 | | | |
31598 | | 1717296310.416331 | 22 | | 1717296000 | 83.36 | 83.36 | 83.36 | | | |
31605 | | 1717296610.4163961 | 22 | | 1717296300 | 83.36 | 83.36 | 83.36 | | | |
31612 | | 1717296910.416549 | 22 | | 1717296600 | 83.36 | 83.36 | 83.36 | | | |
31619 | | 1717297210.4168398 | 22 | | 1717296900 | 83.36 | 83.36 | 83.36 | | | |
31626 | | 1717297510.4161253 | 22 | | 1717297200 | 83.36 | 83.36 | 83.36 | | | |
First thing I noticed here, is that we have the fields created
and start
which are of the time timestamp
however they are not set. Instead, we have a float (created_ts
and start_ts
where last_reset
probably is one as well) representing the unix timestamp in second and nanosecond precision.
So first, why even have this column at all, why not use the actual intended column for this purpose?
Secondly, I’m not sure why we are storing 2 timestamps here, I guess this is ‘start of query, end of query’? From a database storage efficiency point of view, this seems a bit wasteful. Granted, this table was intended for ‘short term storage’ 10 days by default. Still, we could do better, why aren’t we? Something to discuss?
Thirdly, the data itself is ‘timeseries’ in itself. What I’ve learned for timeseries data, is that you’d actually want to store/commit if/when there was a change on the data. e.g. (from AI :p)
INSERT INTO mytable (col1, col2)
SELECT 'data1', 'data2'
WHERE NOT EXISTS (SELECT 1 FROM mytable WHERE col1 = 'data1' AND col2 = 'data2');
so something along those lines should be possible.
Fourthly, and this is just my lack of understanding, what is the meaning of ‘mean, min max’? If I get a piece of sensor data, it has a single value, so when/how would the min/max/mean value come into play? Could we not do this ‘smarter’? For many sensors, we’d only ever get a single value no? Just to enlighten myself, which sensors produce multiple values?
simple example:
3373 | | 1717383610.2993858 | 22 | | 1717380000 | 83.36 | 83.36 | 83.36 | | | |
3386 | | 1717387210.297326 | 22 | | 1717383600 | 83.36 | 83.36 | 83.36 | | | |
3399 | | 1717390810.3020093 | 22 | | 1717387200 | 83.36 | 83.36 | 83.36 | | | |
3412 | | 1717394410.302218 | 22 | | 1717390800 | 83.36 | 83.36 | 83.36 | | | |
3426 | | 1717398010.2976599 | 22 | | 1717394400 | 83.36 | 83.36 | 83.36 | | | |
3440 | | 1717401610.3043294 | 22 | | 1717398000 | 83.36 | 83.36 | 83.36 | | | |
3454 | | 1717405210.3032625 | 22 | | 1717401600 | 83.41644472104998 | 83.36 | 83.54 | | | |
3468 | | 1717408810.2744696 | 22 | | 1717405200 | 83.53999999999999 | 83.54 | 83.54 | | | |
3482 | | 1717412410.3030684 | 22 | | 1717408800 | 83.53999999999999 | 83.54 | 83.54 | | | |
3498 | | 1717416014.1221318 | 22 | | 1717412400 | 83.53999999999999 | 83.54 | 83.54 | | | |
3530 | | 1717419610.4555848 | 22 | | 1717416000 | 83.53999999999999 | 83.54 | 83.54 | | | |
3555 | | 1717423210.4580786 | 22 | | 1717419600 | 83.53999999999999 | 83.54 | 83.54 | | | |
(ignore the fact that the data is not properly rounded, I think that’s a HA bug)
Or is it the case that recorder is doing something smart at the time of transition (3454) where it stores the previous value (in this case as min, but could be min) and the new value (in this case the max)? Is there a (huge?) performance benefit in doing so? Could this not be done in the SQL query itself that retrieves the data? A view/index that could accelerate this? I’m not opposed of this, it just seems like storing a lot of additional data.
Lastly, I’ve been playing with timescaledb (a postgres extension) lately, and they are tackling the timeseries question exactly for this purpose. Timescale itself claims it’s faster and more efficient then influxdb, but in the end, timescale, influx, are all similar solutions to the timeseries problem. Also note, that timescaledb should be fully compatible with the current implementation as it’s using standard SQL queries. Also, also note that one cool trick of timescale db, is that it can automatically do aggregation, which is now done manually.
I do get that the recorder is stuck in with legacy reasons and existing databases etc, where migration is not trivial of course. But I also know there’s at least some demand for more serious data capture of HA sensors. The aggregation HA sensors helps a bit with the storage/performance issues, but in the end, it’s a whole lot of duplicate data we’re storing for no benefit/advantage.