Running with Very Large Databases

I hasten to add that I’m not talking about 10 GB. I’m not even talking about 50GB. I’m currently sitting in the vicinity of 80GB.

(And I should say that I’ve already pruned heavy contributors and too-frequent updaters from the entities I’m storing. I think it’s just because I have closing-in-on 3,000 entities. Of those, the biggest contributor takes up around 3.5% of the space and only six of them more than 1%, so I think the gains from adjusting my recorder settings are pretty much used up.)

I currently am using a remote MySQL 8.2 container to host the recorded database (I have HA running on my cluster with its config directory mounted over NFS from the NAS, so SQLite isn’t an option and avoiding tying the DB to any given cluster node is preferable).

But this is starting to creak. While HA performs well automation-wise and in executing actions, fetching data to display on the dashboard, particularly for any history graphs or other data-over-time types of display has become very slow indeed.

I’m looking for advice on how to optimize this. Would Postgres be better at handling a database of this size, or this type of read? Is there anything else I’ve overlooked that would improve performance in this sort of scenario? Anyone else out there with a really big database and advice to offer?

All thoughts much appreciated!

You wouldn’t want my advice, but I would delete it. I delete mine when it gets to a gig. like once a year or so. I would rather delete everything over like 3 months old, but just tossing it is the next best thing.

1 Like

You’re stretching HA and your database beyond its capabilities. It’s one thing to keep a large database to do your own queries; it’s another to expect HA to deal with a ginormous database, when it isn’t clear that either component was meant to work with one another for that purpose.

Keep your database of archived data if you want; I do that. But keep it separate from HA’s production database.

2 Likes

+1 on what both @Sir_Goodenough and @d921 have said. 80GB? That’s intense! I would also take action on what information you actually want in your DB and adjust your recorder configuration. For example, I’ve got plenty of devices that have signal strength entities (WiFi and Zigbee). I like having the entities for potential future troubleshooting, but why would I care that my bathroom light was -66 dBm 2 weeks ago? Nah… That information is filtered out of my recorder.

2 Likes

I have 3600+ entities in my system and I have ~10GB of MariaDB holding whole my history for 30 days. So something wrong with your setup.
What is you data retention, after how many days do you purge database?
You mentioned nothing about exclusions and inclusion of entities in recorder. I excluded by defauls everything from being recorded and only explicitly added these entities for which I want to display some charts (and there is plenty of them :slight_smile: ). Otherwise you do not need history, only current states. Additional reason for inclusion is some troubleshooting or debugging of entity history, but this is temporary only.
Finally after finetuning of databvase you do noot need to delete it, it will eventually shrink over time as unused history is purged.

1 Like

I’m probably going to end up doing that by default if I end up switching DB servers, so… :slight_smile:

(Shame about the [relatively small] long-term-statistics data. I wonder if there’s any way to fish that out of the old database…)

I should probably have been clear up front that, well, yes, I know. :grin:

By what information we’ve got from the HA team and elsewhere, my installation could be fairly described as a heckin’ chonker along most metrics. (And the forbidden k8s install, too!) So I haven’t got here by accident, as it were, but because I’m the kind of home-automation enthusiast who enjoys pushing the limits, and am looking to share ideas with my fellow limit-pushers.

(I do appreciate your response; just wanted to clarify the angle I’m coming from.)

I do that, too - I rake off data points I want for long-term analysis into a separate InfluxDB. The recorder database itself shouldn’t have anything older than a fortnight.

My recorder database purge_keep_days is set to 14 days (a little longer than the default, but not excessively so, I would have thought), and I have an automation set up to purge and repack the database once per week to keep its size down.

I did, I think? I’m still using default-include as so far I’ve wanted to keep the drill-down-into-history option available, but I’ve always excluded unnecessary entities (like time and date entities, and randomness entities, etc.) from the recorder. I’ve also excluded entities which either have a very large state/attributes and/or which are updated excessively frequently, and have checked this with database queries; I’ve got no outliers left that are taking up unusually large amounts of database space.

It’s a good thought, though, and if I can’t database-tune my way out of this I may resort to excluding all the diagnostic entities except when needed.

(Sure would be nice if we could exclude all diagnostic entities in one line…)

Actually it is very easy (quoting official recorder documentation). What is recorded:

  1. Only includes (mirekmal: only includes listed in configuration.yaml)
  • Entity listed in entities include: include
  • Otherwise, entity matches domain include: include
  • Otherwise, entity matches glob include: include
  • Otherwise: exclude

This is actually what I do, no excludes and includes only and it works perfectly fine, only entities explicitly listed are stored in the database :slight_smile:

I run with a 26 gb database on mariadb and it works fine. I have it on a fast disk and give it 4gb of RAM running on an AMD middle of the road CPU. Memory and disk speed are the key levers. What are you running with?