Recorder stores data inefficiently, timeseries logging probably better?

oliv3r · June 14, 2024, 7:33am

Hey all,

I’ve been looking into the recorder for a whlie, and recently migrated myself to postgresql/timescaledb. Learning more about timeseries, I noticed that currently, the recorder (using postgresql, not sure if sqlite would be any different) does currently store interval based data. E.g. it stores the same value/data multiple times. e.g.

homeassistant=# select * from statistics_short_term where metadata_id=22 limit 10;
  id   | created |     created_ts     | metadata_id | start |  start_ts  | mean  |  min  |  max  | last_reset | last_reset_ts | state | sum 
-------+---------+--------------------+-------------+-------+------------+-------+-------+-------+------------+---------------+-------+-----
 31563 |         | 1717294810.4162643 |          22 |       | 1717294500 | 83.36 | 83.36 | 83.36 |            |               |       |    
 31570 |         | 1717295110.4101014 |          22 |       | 1717294800 | 83.36 | 83.36 | 83.36 |            |               |       |    
 31577 |         | 1717295410.4212794 |          22 |       | 1717295100 | 83.36 | 83.36 | 83.36 |            |               |       |    
 31584 |         | 1717295710.4160612 |          22 |       | 1717295400 | 83.36 | 83.36 | 83.36 |            |               |       |    
 31591 |         | 1717296010.4175048 |          22 |       | 1717295700 | 83.36 | 83.36 | 83.36 |            |               |       |    
 31598 |         |  1717296310.416331 |          22 |       | 1717296000 | 83.36 | 83.36 | 83.36 |            |               |       |    
 31605 |         | 1717296610.4163961 |          22 |       | 1717296300 | 83.36 | 83.36 | 83.36 |            |               |       |    
 31612 |         |  1717296910.416549 |          22 |       | 1717296600 | 83.36 | 83.36 | 83.36 |            |               |       |    
 31619 |         | 1717297210.4168398 |          22 |       | 1717296900 | 83.36 | 83.36 | 83.36 |            |               |       |    
 31626 |         | 1717297510.4161253 |          22 |       | 1717297200 | 83.36 | 83.36 | 83.36 |            |               |       |

First thing I noticed here, is that we have the fields created and start which are of the time timestamp however they are not set. Instead, we have a float (created_ts and start_ts where last_reset probably is one as well) representing the unix timestamp in second and nanosecond precision.

So first, why even have this column at all, why not use the actual intended column for this purpose?

Secondly, I’m not sure why we are storing 2 timestamps here, I guess this is ‘start of query, end of query’? From a database storage efficiency point of view, this seems a bit wasteful. Granted, this table was intended for ‘short term storage’ 10 days by default. Still, we could do better, why aren’t we? Something to discuss?

Thirdly, the data itself is ‘timeseries’ in itself. What I’ve learned for timeseries data, is that you’d actually want to store/commit if/when there was a change on the data. e.g. (from AI :p)

INSERT INTO mytable (col1, col2)
SELECT 'data1', 'data2'
WHERE NOT EXISTS (SELECT 1 FROM mytable WHERE col1 = 'data1' AND col2 = 'data2');

so something along those lines should be possible.

Fourthly, and this is just my lack of understanding, what is the meaning of ‘mean, min max’? If I get a piece of sensor data, it has a single value, so when/how would the min/max/mean value come into play? Could we not do this ‘smarter’? For many sensors, we’d only ever get a single value no? Just to enlighten myself, which sensors produce multiple values?
simple example:

 3373 |         | 1717383610.2993858 |          22 |       | 1717380000 |             83.36 | 83.36 | 83.36 |            |               |       |    
 3386 |         |  1717387210.297326 |          22 |       | 1717383600 |             83.36 | 83.36 | 83.36 |            |               |       |    
 3399 |         | 1717390810.3020093 |          22 |       | 1717387200 |             83.36 | 83.36 | 83.36 |            |               |       |    
 3412 |         |  1717394410.302218 |          22 |       | 1717390800 |             83.36 | 83.36 | 83.36 |            |               |       |    
 3426 |         | 1717398010.2976599 |          22 |       | 1717394400 |             83.36 | 83.36 | 83.36 |            |               |       |    
 3440 |         | 1717401610.3043294 |          22 |       | 1717398000 |             83.36 | 83.36 | 83.36 |            |               |       |    
 3454 |         | 1717405210.3032625 |          22 |       | 1717401600 | 83.41644472104998 | 83.36 | 83.54 |            |               |       |    
 3468 |         | 1717408810.2744696 |          22 |       | 1717405200 | 83.53999999999999 | 83.54 | 83.54 |            |               |       |    
 3482 |         | 1717412410.3030684 |          22 |       | 1717408800 | 83.53999999999999 | 83.54 | 83.54 |            |               |       |    
 3498 |         | 1717416014.1221318 |          22 |       | 1717412400 | 83.53999999999999 | 83.54 | 83.54 |            |               |       |    
 3530 |         | 1717419610.4555848 |          22 |       | 1717416000 | 83.53999999999999 | 83.54 | 83.54 |            |               |       |    
 3555 |         | 1717423210.4580786 |          22 |       | 1717419600 | 83.53999999999999 | 83.54 | 83.54 |            |               |       |

(ignore the fact that the data is not properly rounded, I think that’s a HA bug)

Or is it the case that recorder is doing something smart at the time of transition (3454) where it stores the previous value (in this case as min, but could be min) and the new value (in this case the max)? Is there a (huge?) performance benefit in doing so? Could this not be done in the SQL query itself that retrieves the data? A view/index that could accelerate this? I’m not opposed of this, it just seems like storing a lot of additional data.

Lastly, I’ve been playing with timescaledb (a postgres extension) lately, and they are tackling the timeseries question exactly for this purpose. Timescale itself claims it’s faster and more efficient then influxdb, but in the end, timescale, influx, are all similar solutions to the timeseries problem. Also note, that timescaledb should be fully compatible with the current implementation as it’s using standard SQL queries. Also, also note that one cool trick of timescale db, is that it can automatically do aggregation, which is now done manually.

I do get that the recorder is stuck in with legacy reasons and existing databases etc, where migration is not trivial of course. But I also know there’s at least some demand for more serious data capture of HA sensors. The aggregation HA sensors helps a bit with the storage/performance issues, but in the end, it’s a whole lot of duplicate data we’re storing for no benefit/advantage.

koying · June 14, 2024, 9:39am

The HA database never was and never will be designed for statistical data.
The “long term statistics” tables you show are a quite recent addition, where we did not have anything past the 10 days (per default) database purge before.

The response has been and always will be: if you want solid long-term statistics, export the data to an actual timeseries db like influxdb.

oliv3r · June 17, 2024, 5:31pm

So I was hoping to get a little bit more productive feedback

First,

The response has been and always will be: if you want solid long-term statistics, export the data to an actual timeseries db like influxdb.

So why not have a influxdb backend for the recorder?

Anyway, If we want to keep all events and sensor data, I don’t think there’s an easy way to log everything there, right? ANd even if so; it’s a bit of a band-aid isn’t it? We’d be storing everythign thrice. statistics, statistics_short_term and influxdb. Plus all the management overhead etc of course.

So is it an Architectural decision to let the recorder never to evolve/become better? It’s use case is and always will be ‘it’s meh, but good enough, there’s no need to improve it’? If so why was the statistical model even added, if the answer is ‘just use influxdb’?

koying · June 17, 2024, 6:13pm

Because Influxdb is a timeseries db. I don’t think it’s even capable of doing anything relational.

Not sure what your expectations were in creating this thread, but if it was that the devs would pop up and argue with you on the db design, that will just not happen, I’m afraid.

There is a dedicated GitHub for architectural discussions. You will have more chances to at least be read, there.

oliv3r · June 17, 2024, 6:24pm

Lol, the devs sent me here, because architectual discussions are supposed to be held on the forum So what makes you say that devs don’t talk about architecture here? But, for reference purposes Recorder stores data inefficiently, timeseries logging probably better? · home-assistant/architecture · Discussion #1100 · GitHub

So, what I’m ultimatly after, is using timescaledb as a timeseries database, by using the existing recorder/alchemy with timescaledb that’s possible. Extending this to influxdb in the future, if possible, someone could …

As for the current database design, there’s some low-hanging fruit/wins before going to a real timeseries database. E.g. we should store delta’s instead of samples. E.g. there’s no point of storing 100000 the same entry, with just a different timestamp.

Likewise, why are we storing the timestamp as a float, when we have a timestamp type’d column.

koying · June 17, 2024, 7:04pm

Sure, sure. I’m eager to read what they will write

At least you had an embryo of a backstory.
Just keep in mind statistics are a pretty late afterthought.

I’m not even sure what decided the devs to add them eventually.
Maybe the years of complaining about the complete disinterest about using HA as an history db

parautenbach · June 18, 2024, 7:42am

Something you might want to consider as part of this discussion are the use cases. When I choose a DB technology, I use various factors. This might be time series data, but if it’s not frequently referenced as such, then you don’t gain (necessarily) by storing it in a time series DB. Time series DBs are usually fast on insert, so that is a good thing for HA.

Perhaps also state what you think is lacking using the existing solution (and yes, I know you are making an enquiry at this point in time). Are there particular performance issues? Do you want to improve specific way data can be retrieved or otherwise processed?

You also mention things that are not specific to the DB tech chosen are the “duplicates” you noted. Better to keep the issues apart.

Lastly, remember that SQLite is the default DB, which makes it super easy for new users to get going. I’m pretty sure that is an important aspect here. Choosing another, less general (albeit more powerful), DB technology might create an unwanted barrier. This is probably why the third-party setup for an Influx DB would be preferred: It’s optional and for power users, because the existing setup strikes a good compromise.

These are just some brief thoughts.

oliv3r · June 18, 2024, 10:02am

Something you might want to consider as part of this discussion are the use cases. When I choose a DB technology, I use various factors. This might be time series data, but if it’s not frequently referenced as such, then you don’t gain (necessarily) by storing it in a time series DB. Time series DBs are usually fast on insert, so that is a good thing for HA.

Absolutely, those are you starting point requirements. What do we expect to store here. You could argue, things have just naturally evolved and the needs have changed without formally realizing this.

So what do we see right now? The database serves for two purposes. One for configuration management (in parallel/supplementing the yaml files). And for the recorder component.

I did put in the topic, the first word, that I would be questioning the recorder So lets focus on that. As I agree, that using SQLite even for configuration items is a no-brainer (if we use a database for this purpose). Heck, splitting these two apart makes even sense here.

So looking at the recorder it’s main purpose seems to be recording (historical) data of sensors/events, right? Arguing that we only store ‘slowly changing data’ e.g. sun/moon times etc quickly becomes moot. But I fully admit, I only know my use cases. So far, my home-assistant useage is very limited. I have my car (renault) connected, and get sensor data there. I also have a ‘slimmelezer’ connected, but I still only have a feraris meter ;p

So looking at the ‘energy monitoring’ use-case, which is very prominently part of the default installation, the time-series use-case becomes even stronger.

TimeSeries databases are designed for this use-case I would argue, so it’s not unlogical to think into this direction.

Perhaps also state what you think is lacking using the existing solution (and yes, I know you are making an enquiry at this point in time). Are there particular performance issues? Do you want to improve specific way data can be retrieved or otherwise processed?

So lets focus for now on a single use-case, energy-monitoring. This is an area HA seems to pull attention towards. A standard setup would store time-series (sensor) data of energy consumption in an SQLite database (Yes, this is still time series data, even if we do not format it as such and use it as such, the nature of the data is time series).

So lacking here, firstly is the fact that we cause a huge number of unnecessary writes. Again, lets assume a raspberry pi setup with an SD card. Not unlikely (even if not recommended), but also a hosted server with SSD storage. I think I don’t have to go into detail why it’s bad, but for those unaware, all these writes (even if they are somewhat cached in RAM first) for each datapoint, could cause a (small) write (with huge impact on SD wear). On a server farm, this also gets amplified by a huge number of incomming writes, there is a cost associated to this fact (This could financially impact nabu-casa).

Secondly querying the data can become problematic if a lot of data is written, because also reads are inefficient (maybe that’s why we have min/max/avg in the table?), but the default setup purges data after 10 days, right? However we have a statistical data model now, where we store summary data of those 10 days, right? I haven’t dug into how this is done exactly, but I also saw that for now, it doesn’t matter, because we use the same inefficient storage methodolgy. So after X years of writing a lot of aggregated data, we have a messy and slow database.

You also mention things that are not specific to the DB tech chosen are the “duplicates” you noted. Better to keep the issues apart.

Technically correct, but conceptionally I don’t think it matters. The current database design decisions hint towards a solution.

But to make it more clear, I am actually not suggesting to use influxdb for the recorder, the database tech is too different. It could/should certainly be possible in the future of course, to have the recorder expanded to also use influxdb.

I’m actually suggesting (based on my assumptions and lack of current knowledge, hence the initial questions I asked) that we first re-organize the database/tables to be time-series oriented. E.g. to dumb it down, a simple table with just 3 fields. timestamp, metadata_id, value. Secondly, and this is an important database design decision, evaluate if it makes sense to keep all sensor data in a single table, or have a table per sensor, which would reduce the columns to just two timestamp, value. The metadata_id is now moved to the table name. The choice here is whether the dominant query is based on timestamp or metadata_id.

Let me explain (even goes a little to deep for now). If we want to query all sensors on timestamp X, having a single table where we can state ‘select * from statistics where timestamp between X-1, X+1’; In that case, its best to have single table. I can’t imagine this usecase too much however, and imagine it more likely we just want to get graphs/historical data on a single sensor (e.g. select * from metadata_id. The other thing is still possible, but require more complex joins. But due to my ignorance, I don’t even know if this ‘cross-data query’ is currently even possible with recorder. I fully understand btw, that the current database design is what it is due to reasons, which I was asking about in my start-post. It does influence of course which columns to put in there.

Lastly, remember that SQLite is the default DB, which makes it super easy for new users to get going. I’m pretty sure that is an important aspect here. Choosing another, less general (albeit more powerful), DB technology might create an unwanted barrier. This is probably why the third-party setup for an Influx DB would be preferred: It’s optional and for power users, because the existing setup strikes a good compromise.

Yes/no/maybe. With that I mean, SQLite is certainly easier for newcommers. sqlite3 file.db and get going. Use a nice gui, no extra tooling required. But also, newcommers I’d expect to use the recorder as the interface to the database, it abstracts the database doesn’t it? Regardless, as you rightfully mentioned, if we get the database format in better shape, the database tech is still irrelevant (as you rightfully pointed out).

Now assuming our database is the 3 columns I mentioned above, with regards to database tech, I’d be proposing to add support for timescaledb. I also understand that this is something most are unaware of, so let me explain it a bit.

Timescaledb is an extension to postgresql. It works with pure ‘standard’ SQL commands. This is important, because it means that pySQLAlchemy can still be used. So what makes timescaledb differnt? Actually nothing. You use your postgresql database as normal. E.g. you can run a timescaldb enable postgresql instance as normal (without any timescaledb advantages).

The addition to recorder would upon detecting/requesting the table would need to be converted. So what does that mean? SELECT create_hypertable('conditions', by_range('timestamp'), migrate_data => true);

After that, the table still has the same columns, the queries are still the same, but now timescaledb can do its thing.

There’s a bit more, with regards to creating indexes (if not existing) e.g. (and now I’m way over my head, but as I understood it) to have an index on a view where you have multiple data types in the same table (like we do now). But also aggregation. Timescale can do aggregation ‘behind the scenes’ automatically. And a few other things (Things like averaging uppon an new data entry).

But! These would be optimizations to the current SQLite solution! The SQLite solution shall, and will, function as normally. But, in the case of aggregation, it would need to be done manually for SQLite, Mariadb etc.

koying · June 18, 2024, 1:06pm

Just to avoid confusion: No configuration data goes into the db.
All the configuration is in the YAML files (old style) or in JSON in the .storage subdirectory.

The db serves 2 different purposes, though, indeed :

transient data: detailed state changes et al. Those are purged after 10 days, by default
statistical data: Long term statistics, aggregated on 5min and 1h durations.

I’m pretty sure I’ve already seen proposals to have those 2 usages split between 2 databases, which would make plenty of sense imo (separation of concerns).

parautenbach · June 18, 2024, 3:10pm

There’s a little bit of metadata in the DB, but configuration actually goes to JSON-structured files in .storage (besides YAML, as you mentioned).

Since you can actually run HA without a DB (i.e. nothing going to the recorder), I’d just say that HA is really (mostly) only concerned with current states. So, from that perspective, I’d say the main goal of the DB is actually to persist current state values, and secondary, state history. Keep in mind that the recorder settings don’t apply to the statistics tables.

I mostly agree with this, except to say that energy monitoring and management is a specific subset of HA functionality – only for users that have an energy dashboard.

Just to make sure I follow you here: HA states are only written when they changed, but statistics can have duplicated values, and you’re referring to the latter, giving your original post and example data, right? I have high write rates on the sensors from my inverter, since I get an updated every second for a multitude of sensors. In that case, the number of writes cannot be minimised, unless one does some filtering (in other words, the values actually change).

Not quite. States are purged (by default) after 10 days, but statistics are retained indefinitely (at this point in time). All these tables are in one logical DB, but it’s really two separate structures/schemas, with the only link the entity ID.

In its abstract form, I understand what you’re saying. Given HA is an automation platform (primarily), I think if you want to get traction on this, you’d need to show examples of existing functionality that can be improved or what useful new functionality can be added.

For example, I’ve often had a need to check (in a condition, as part of an automation) whether a sensor was in a particular state at a certain point in time. You can achieve this e.g. with a SQL sensor or using a variety of helpers, but I’d much rather this be a built-in feature.

oliv3r · June 18, 2024, 4:13pm

Thank you for pointing that out That does help clear out the discussion and the function of the database!

As for splitting it in 2 databases, I’m not sure I agree. For the SQLite usecase, sure; it can make some what sense, but in the end, you just want two tables, which we already have.

So then lets look back at the current table, since short-term and long-term follow the same datamodel, lets (for now) speak of them as one.

Currently, the database looks like this:

 id            | integer                  |           | not null | generated by default as identity
 created       | timestamp with time zone |           |          | 
 created_ts    | double precision         |           |          | 
 metadata_id   | integer                  |           |          | 
 start         | timestamp with time zone |           |          | 
 start_ts      | double precision         |           |          | 
 mean          | double precision         |           |          | 
 min           | double precision         |           |          | 
 max           | double precision         |           |          | 
 last_reset    | timestamp with time zone |           |          | 
 last_reset_ts | double precision         |           |          | 
 state         | double precision         |           |          | 
 sum           | double precision         |           |          |

My database, has no data in any of the non-_ts columns. Why is that? The _ts columns hold ironically timestamps, but in float8 formats. Why is that?

Secondly, we keep min/max/avg values for each data entry. For the aggregation table I can somewhat understand this, but for the short_term table, this is something I don’t understand. Over what period is the min/max/avg? The last X entries? some time interval? Why can’t this be calculated realtime. What’s the historical value here?

parautenbach · June 18, 2024, 5:07pm

Schema changes over time. Old columns were migrated, but I’ve never seen them removed (potentially due to the risk involved if there’s a bug).

To really compare, create a clean HA instance. I cannot confirm, but I suspect you won’t see the old columns.

koying · June 19, 2024, 7:41am

My opinion, and this thread is a perfect example, is that “transient” data might stay in a relational db, and sqlite is perfectly fit for the job unless you go multiuser for whatever reason, while historic data might go to some timeserie db.

So 2 db to have the opportunity to have 2 different engines / usages, really.

oliv3r · June 19, 2024, 9:05pm

This is a new install well new 5? weeks ago. And going from timestamp to float sounds like a downgrade in terms of functionality

parautenbach · June 20, 2024, 5:53am

Yeah, I’m not sure whether old columns got removed from the schema scripts. I haven’t done the comparison myself and I haven’t been able to find the create scripts. It must be somewhere in the core repo, but where?

True, it doesn’t help with inspecting data, but certain native types can be better for performance. Still, it’s fairly simple to convert it in a query if needed. Given this topic is mostly about optimisation, I’d think convenience is of a lesser concern.

oliv3r · June 20, 2024, 8:08am

I don’t disagree, and I thought of this as well; but look what’s inside the database. It mostly IS timeseries data already in there …

oliv3r · June 20, 2024, 9:09am

True, it doesn’t help with inspecting data, but certain native types can be better for performance. Still, it’s fairly simple to convert it in a query if needed. Given this topic is mostly about optimisation, I’d think convenience is of a lesser concern.

Ah, well timescaledb can create a hyper table where one column needs to be a timestamp (preferred), an integer (which is just a timestamp), or a date (a shorter timestamp). A float isn’t supported. So then my question comes back. A timestamp is a 64bit int. ints are generally more efficient in terms of CPU and probably also in terms of SQLite. So why was it converted from a sensible format (timestamp) to a float? (Btw, a timestamp as just a 64bit integer with a certain representation during ‘select’ I would think). And its a SQL required thing, anyway …

parautenbach · June 20, 2024, 12:00pm

A Unix timestamp as an int goes down to the second, whereas a float can do milli- and microseconds. Datetime objects typically go down to at least the millisecond, usually.

What is your efficiency concern? It’s not the PK.

oliv3r · June 21, 2024, 4:51pm

PostgreSQL: Documentation: 16: 8.5. Date/Time Types states that a timestamp is a 64bit type down to the microsecond, dunno how SQLite handles this.

But this is an interesting point. SQLite doesn’t state anything on precision here Date And Time Functions so no idea if those are 64 bit timestamps or something else. They do mention ‘subsecond’ precision.

So now the question turns around again to the beginning, ‘what is the purpose of the recorder’. If we want timeseries data with microsecond precision, then using a cludge like SQLite with floats, because the native types do not support proper 64bit timestamps is a bit odd. While you are correct, you could do microsecond, nano second but also even minute precious with a float, the float general problem remains. They are not precise and working with them is ‘harder’ (comparing floats is generally a big no no, just because of that).

I get the ‘starter’ argument, but then the argument also quickly becomes ‘but it’s not a serious solution’, and if it’s not a serious solution, then the whole mariadb/postgres ‘possibility’ also becomes questionable.

If on the otherhand we want to say the recorder will be a serious solution to do timeseries recordings (without adding much complexity compared to what we have now), then we should look at my proposal. Where we then state 'sqlite to get started, you get sqlite subsecond resolution, if you want to store data longer and more serious, you can use the same recorder, but have to use postgresql (or if possible mariadb, or once added in the future influxdb) as recorder backend.

The proposed just ‘use influx db’ doesn’t integrate nicely though with HA’s default way of working, so it’s a subpar solution. Also, for those that would prefer to use a different database then influxdb also ‘suffer’ …

parautenbach · June 22, 2024, 7:27pm

Btw: core/homeassistant/components/recorder/db_schema.py at dev · home-assistant/core (github.com)

For example:

    __tablename__ = TABLE_EVENTS
    event_id: Mapped[int] = mapped_column(Integer, Identity(), primary_key=True)
    event_type: Mapped[str | None] = mapped_column(UNUSED_LEGACY_COLUMN)
    ...