How to keep your recorder database size under control

HeyImAlex · March 26, 2022, 9:21pm

Ok noted, makes sense.

gon · April 13, 2022, 10:47am

i have been doing manuell deletes now.
but after i did that the day before yesterday, during the nightly recorder purge call with repack something went wrong and the whole history in HA vanished.
so a loss of 3 days, nothing serious, but strange nonetheless…

le_top · April 16, 2022, 8:53pm

I’ve added sqlite to the “Terminal and SSH” addon and made a script to execute the queries:

apks:
  - sqlite
authorized_keys:
  # Rest of your terminal+ssh configuration

#!/bin/bash -xv
sqlite3 -readonly home-assistant_v2.db <<EOSQL
SELECT
  SUM(pgsize) bytes,
  name
FROM dbstat
GROUP BY name
ORDER BY bytes DESC;
EOSQL


sqlite3 -readonly home-assistant_v2.db <<EOSQL
SELECT
  COUNT(*) as cnt,
  SUM(LENGTH(event_data)) as bytes,
  event_type
FROM events
GROUP BY event_type
ORDER BY cnt DESC;
EOSQL


sqlite3 -readonly home-assistant_v2.db <<EOSQL
SELECT entity_id, count(*) as NUM FROM states GROUP BY entity_id ORDER BY NUM DESC LIMIT 1;
EOSQL

rkodey · April 22, 2022, 11:25pm

Hi all-
Coming back to the original post

Viewing `states` usage

Had a nice query to see which entities were writing the largest state changes.

…well now we need a new query to understand which entities are writing the most attributes into the new optimized table structure.

I believe this does it reasonably efficiently. (Note this is tested only on MariaDB)

select
  entity_id
, b.cnt
, round(b.cnt * 100 / (select count(*) from states), 2) as "cnt_pct"
, b.bytes
, round(b.bytes * 100 / (select sum(length(shared_attrs)) from state_attributes), 2) as "bytes_pct"
from (
  select entity_id, count(*) as "cnt", sum(length(shared_attrs)) as "bytes"
  from ( select distinct s.entity_id, sa.attributes_id, sa.shared_attrs
         from states s join state_attributes sa on sa.attributes_id = s.attributes_id
       ) a
  group by entity_id
) b
order by bytes desc

bdraco · April 22, 2022, 11:37pm

We did quite a bit of work to reduce this a bit more for 2022.5.

If there is a particular integration that generates the most state_attributes data, we might be able to improve it as well.

rkodey · April 23, 2022, 12:04am

Awesome - it’ll be cool to see the next iteration.

Integrations like UPnP/IGD that talk to wireless routers produce essentially random numbers. There’s not a whole lot of optimizations to be had there. So, queries like above will still be useful to identify the problem children to exclude from the Recorder entirely.

AleXSR7001 · April 24, 2022, 6:16am

Hello everyone,

I was checking my database, now that I have collected a weeks’ worth of data again, and noticed that the events table is needlessly flooded with the state_changed entries.

I would like to disable tracking of state_changed in the events table, but it seems that when I disable this in the recorder integration, the actually change is also not tracked (which kind of makes sense).

Is it possible to only disable state_changed tracking in the events table? It is included in the states table anyway and causes the database to always have roughly twice the size and twice the writes it needs to have (two tables with more or less the same information).

recorder:
  db_url: !secret mariadb_homeassistant_url
  commit_interval: 60
  auto_purge: false
  include:
    entity_globs:
      - sensor.*_energy
      - sensor.*_power
      - sensor.*_today
      - sensor.eq_3_*
      - binary_sensor.garage*
      - sensor.home_assistant_v2_db_size_bytes
  exclude:
    entity_globs:
      - sensor.*_energy_total*
      - sensor.*_reactivepower
      - sensor.*_apparentpower
      - sensor.*_energy_voltage
      - sensor.*_energy_current
      - sensor.*_yesterday
    event_types:
      - state_changed
      - lovelace_updated
      - hacs*

This code now prevents any states from being written to my DB. Which obviously was not my goal

bdraco · May 1, 2022, 1:45am

If you don’t record the event, the logbook will no longer work. There will also be no way to historically tell which user caused an event, or which automation triggered an event since all the context ids will be missing.

The system isn’t designed to work in such a degraded state where part of the state_changed event is missing so its not possible to do what you are asking.

AleXSR7001 · May 1, 2022, 4:30am

It is designed to give the user control. That is why you can set eveything to fatal (which I have).
It is also irrelevant who caused an event.

If, for some reason, you feel it is necessary to force the events table, then at least make it optional to track the needless state_changed. This has no usefull information. In basically all cases this is caused by the data source and not a user.

It was announced that database was optimized, yet ~50% of the entries (and hence writes) are useless writes in the event table (every state change gets one write in states and one in events, so this e.g. in my case is basically 50 %).

AleXSR7001 · May 1, 2022, 11:16am

@bdraco: I was told I sound combative. That is not my intention and I apologize if that is the case.
Just trying to make my argument on the matter.

So, for me, the more control a user has the better and the more in line it is with the very good approach to improve the possibilities for the user to control his/her/their system.

As we are all trying to optimize our databases (for size, for write cycles or any other reason), I feel that it would be helpful if
Option 1: It were possible to deactivate writing to certain tables
Option 2: Certain features could be deactivated

Meaning, it would be awesome if we could e.g. disable state_changed in events table because it holds no raw states data. It is only a logbook of events. Many people do not need this and it is responsible for 50% of the writes and filesize (if statistics is disabled) or ~40% of writes and filesize (if statistiscs are enabled).

So in my opinion, the impact on write cycles and database size would be very, very significant.

bdraco · May 1, 2022, 1:31pm

There are still optimizations that can be done that can reduce the size of the events table that are not major breaking changes. I’m working on those now (De-duplicate event data into a new event_data table by bdraco · Pull Request #71135 · home-assistant/core · GitHub Use ULID short format for context ids by bdraco · Pull Request #71119 · home-assistant/core · GitHub Add json decode caching to logbook by bdraco · Pull Request #71080 · home-assistant/core · GitHub).

Currently there are no plans get rid of the state_changed events in the events table as removing them would break the logbook and associated automation debugging. Its likely possible to remove them in the future with additional major refactoring (its a lot of work), but we are focusing on optimizations we can do without major breaking changes first.

AleXSR7001 · May 1, 2022, 1:45pm

Do you think it would be possible to select a different exclude filter in ‘recorder’ that would allow for filtering these out? Maybe that is the simpler approach?

At the moment the only way to prevent entries to events table is by excluding ‘event_type: state_changed’ but this will also exclude the states from states table (so the actually wanted data).

Maybe it would not be a breaking change anymore if excludes could be combined with target table?

exclude:
  - event_type: state_changed @ events

Something like that?

bdraco · May 1, 2022, 1:48pm

That would break the purging data since the purge process could no longer find it in the database since it looks up the state from the linked event_id which would mean your database would grow in size forever.

AleXSR7001 · May 1, 2022, 1:56pm

That would not be good, I agree

So purging is not based on time stored in each table, but only the time stored in events table? Because states table also has time information, if I remember correctly.

bdraco · May 1, 2022, 2:02pm

Correct

It does have it, and the time values are the same for last_updated and time_fired for data that was recorded in the last few months (some users will still have ones that are not in sync on older data). They didn’t use to be the same but that was done to prepare to be able to optimize it in the future, we just aren’t ready to do that yet.

AleXSR7001 · May 1, 2022, 2:22pm

Interesting, thank you very much.

Self-sufficient tables were of course then a great improvement.
When was this introduced?
And it would “only” affect data that was stored ages ago as new data purges could be based on the times per table?
So no infinite growth. Only data before this change would be permanently stored.

Might it be an option to perform an automatic transfer of the time information from events to state for those old setups?
Since the links are available, an automatic update of the states table can be performed and afterwards the link can be broken. Then even old data would be complete and each table would be self-sufficient.
From memory it would be a join/merge of the two tables in a new one, then delete and rename (less direct approach to prevent data loss).

bdraco · May 1, 2022, 2:44pm

At this point we are no longer discussing the original topic and have wandered into feature request territory.

I suggest waiting for future optimizations to come down the pipe as what you are discussing here will likely mean adding more features that we would have to account for which would delay work on future optimizations.

AleXSR7001 · May 1, 2022, 3:20pm

Okay, looking forward to future improvements then
Fingers crossed you will find an easy solution. Should significantly reduce write cycles, judging by my db

CaptTom · May 2, 2022, 12:56am

I’m looking forward to all the changes! It’s great to see the Recorder database getting some developer love!

Speaking of FRs, there have been a number of them requesting variations on the theme of making it possible to retain some events and state changes longer than others, rather than have just one retention period for all (or exclude the entity altogether.) Here’s the one I submitted, but it’s a subject which comes up a lot, with a number of (good) ideas on how to implement it.

Is something along these lines in consideration?

Thank you!!

TheHolyRoger · May 2, 2022, 2:27pm

Don’t take it personally mate, most of the mods here seem to say that any time you suggest any feedback at all no matter how much you kiss up to them. I just praised the 2022.5 changes on the discord and mentioned one thing I considered unnecessary then got called “massively negative”.

I think you’ve got some very good points personally and I have noticed the sqlite database starting to grow out of control lately. @bdraco thanks for working on it! Much appreciated.

How to keep your recorder database size under control

Viewing states usage

Viewing `states` usage