Database performance split data into multiple tables

cybernard · June 9, 2023, 9:01pm

I request that the sensors data be split into multiple tables for performance and multi-threaded reasons.

My database is 30gb, but its ALL in a single table resulting in bad performance.
All support databases support 2+ billion tables per database.
There is no reason why each device or each sensor can’t have it’s own table.
Even having a table per month would be a better solution.

koying · June 10, 2023, 9:16am

Where did you get that having multiple tables will enhance “performance and multi-threading”?

michaelblight · June 11, 2023, 6:20am

Maybe you should be looking into why you’re database is 30GB. From what I’ve seen most people, including me, have a 1-2GB database with purge days set to 7. And I have 1,002 entities. Perhaps you have some text-based sensors that change state frequently for which you don’t really need the history? You could exclude these from the recorder.

Having said that, my states table is 560MB which should stay fairly stable, but my statistics table is 650MB, which will grow forever.

mirekmal · June 11, 2023, 9:04am

Not many people keep history over 1 month, so there would be no (or minimal) benefit for such split… If you need more history to be retained, perhaps solutions like InfluxDB would be better one.
I have 2400 entiities right now and I keep 30 days of history in my DB. With very selective configuration of what I want to keep in DB (basically only these entities that I use in some charts in UI plus energy sensors) I have DB size limited to ~3.5GB.

CaptTom · June 11, 2023, 11:53am

There have been some great improvements to the Recorder database recently, and hopefully it will continue to get better. So I think this is a good conversation to have.

My recommendation would be to split it into (at least) two databases. The statistics tables used for permanent, long-term data should be separated from the rest of the data which gets routinely purged. Archival and real-time data don’t belong in the same database.

At an even finer level of detail, each entity should have its own retention period. Some data I want to keep for a day, some for a week, some forever. The current keep_days setting is a one-size-fits-all solution, which doesn’t really fit well with all the various kinds of data HA generates.

koying · June 11, 2023, 3:14pm

Totally agree on that. One (the “day-to-day” one) would be disposable, while the other (the statistics) has to be kept at all costs (slight exaggeration )

cybernard · June 11, 2023, 9:11pm

the larger the tables the worst the performance. Multi-threading allow multiple tables can be accessed at the same time. Also not every query needs all the sensor data, this allows you to only query data you need at the time.

cybernard · June 11, 2023, 9:20pm

I keep all my data.
I want years of data to data mine and compare even if I use 3d party tools to do the analysis.

I want to have years worth of data, so I can compare 1 year to the next. 30 days is nothing, I can’t compare winter data to winter data and summer to summer.
When I change my windows,siding, or roof I want to compare the heating and/or air conditioning data from before and after.

I still end up with 30gb+ of data and that is only about 1 year.

cybernard · June 11, 2023, 9:24pm

I want to do year by year comparison of my data so 7 days isn’t going to cut it.
I replaced my roof, how did that affect my energy efficiency?
Unless I can compare the previous 365 days against the post 365 days I won’t have good idea of the changes.

CaptTom · June 12, 2023, 12:22am

I like the idea of keeping it all, forever. Storage is cheap. No argument there.

But that’s a very different requirement from a real-time database being used to collect, analyze and display the data in the HA UI. We have (at least) two different sets of requirements here. A one-size-fits-all solution will always be less than ideal.

Think of an airplane’s flight data recorder. Thousands of datapoints are tracked, but only for something like 30 minutes. There are lots of other data about that airplane which are extensively documented and tracked - how many landings each tire has performed, where each part came from and who was on shift at the factory when it was made, any problems encountered during the flight, the flight crew’s rest cycles between flights, it’s all retained. But not in the same database.

Without really realizing it, I’ve already developed my own workaround. There are data I want to be able to mine later, in case I get new insulation or an HVAC system, as in the example above. Those data never even make it to the database.

I log each on/off cycle of each zone in my HVAC system, as well as daily summary data like how long each zone was calling for heat or cooling, and the total daily runtime of my boiler’s burner. These are appended in real time to a number of text files. These text files are renamed monthly. My routine backups pick them up and archive them on my NAS. I can pull them into any software to analyze them any way I want.

This leaves the HA database for short-term and real-time data only. Many of my entities I don’t care about at all, and I exclude them from Recorder. The rest are only kept for a few days. I can delete the database without much chagrin.

koying · June 12, 2023, 3:29am

You should use, e.g. InfluxDB for that. Those time based database are meant for this.
Added bonus is that you can use proper tools to get proper statistics/ graphs, e.g. Grafana.

The HA database schema was never meant for your use case, especially if you want to keep the detailed data rather than use the aggregated statistics, which are never purged.

cybernard · June 12, 2023, 11:34am

All my data is still in a single table, and if it gets corrupted I lose everything.

CaptTom · June 12, 2023, 11:51am

You may be confusing terms here. My HA database has 13 tables. For what it’s meant to do, it’s a pretty good database design. As I said, it would make sense to me to separate out the different types of data into different databases, but the data are already split between multiple tables.

cybernard · June 12, 2023, 9:49pm

What I mean is this.
99% of my data resides in

-rw-rw---- 1 mysql mysql 38G Jun 7 16:30 states.ibd
-rw-rw---- 1 mysql mysql 1.7G Jun 6 22:00 statistics_short_term.ibd

Every other table is measured in KB except one that is 12 mb.

Having 30gb of data in single table seems to slow home assistant way down.

A database backup takes approx 2 hours, and then causes further errors because the database is locked the buffer is full and it can’t log any more data until the backup is complete

CaptTom · June 12, 2023, 11:17pm

OK, I see.

Yeah, 40G is not good. My database right now is 12M. I started excluding entities and reducing the keep_days parameter when it first hit 3G.

All I can say is that I don’t believe the database is designed for long-term archival storage, nor would I want it to be. It’s a real-time system. The archival stuff belongs elsewhere. That’s just one guy’s opinion. Take it for what it’s worth.

iridris · June 13, 2023, 6:54pm

There’s a massive amount of database information in the following thread, including queries you can run to show you which entities are hitting the database the heaviest. I’d highly recommend reading through it and applying some of the info to your setup to reduce your database size. There have been a lot of database changes since that thread started, so maybe start towards the bottom and work your way backwards to make sure you’re looking at recent/relevant information.

denilsonsa · June 13, 2023, 7:49pm

Hi, I recently added some links at the top post pointing to the most relevant recent comments with newer queries. This way, people shouldn’t need to read the entire thread nor start from the bottom. It’s very likely I missed some, feel free to suggest what other replies I should also link from the top post.

cybernard · June 14, 2023, 2:28am

I was testing on an external database and increases these values to 2gb made a big difference for performance.

innodb_buffer_pool_size=2048M
innodb_log_file_size=2048M

How do I permanently increase these values since docker seems to wipe them out on restart.