Speed-up Home Assistant by reducing the database size

tbressers · February 28, 2020, 10:28am

Would it be possible to add a configuration item to reduce the maximum generated amount of data per minute (or hour)? My home-assistant database is flooded, and there seems no good way to prevent that (hence, the recorder purging doesn’t seem to do much / enough)

Burningstone · February 28, 2020, 10:36am

I think this should be more a configuration on the side of the device that it reports less frequently. To reduce the DB size, I only include entities for which I really want a history, e.g. I don’t need a history of the time sensor or the sun state.

tbressers · February 28, 2020, 11:13am

I did so… but it seems that my energy readings (e.g. sensor.power_consumption) produce data every millisecond or so… it would be great if I could adjust that to seconds or minutes.

Burningstone · February 28, 2020, 12:02pm

Can you not change the reporting frequency of the energy reading device?

tbressers · February 28, 2020, 12:33pm

I’m in contact with Nigel Dokter from dsmr_parser for that purpose. Currently that’s not possible, but it might be a good solution indeed.

However, I think Home Assistant transfers the data to the database… so maybe it should be implemented in Home Assistant…?

AaronCake · February 28, 2020, 2:30pm

The only way to reduce the database size is to change the way Home Assistant stores data.

Currently data is stored in JSON strings containing entity, states, attributes, etc. These could be several hundred bytes and are just stored as text blobs within the field.

This makes it extremely inefficient to store data as database native types are completely ignored, destroying all optimization the database engine may do. It also makes queries thousands of times slower than they should be because it causes a lot of string parsing during queries, limits what can be done in SQL, and pushes a lot of processing to the client.

Sort of like treating a relational database as a flat log file.

Entities should instead link to an other table holding attributes, states should like to the entity table, all attributes stored as their native data types, and indexes set up for all common queries.

tbressers · February 28, 2020, 3:27pm

As always the solution was simple: Just add “precision: 1” to the dsmr platform in configuration.yaml. That will reduce the data flooding caused by DSMR:

sensor:

platform: dsmr
port: /dev/ttyUSB0
dsmr_version: 5
precision: 1

dave_t_uk · March 3, 2020, 1:01pm

I am looking at this topic and think I have a couple of ways it could be implemented, see also here:

I tend to agree that the requirement for recording frequency should be defined at a device level.

Some sensors will almost have no need for recording, some will need infrequent records etc. Homeassistant by default logs everything.

Ditching resolution just to decrease the logging rate seems the wrong solution.

AhmadK · March 3, 2020, 1:23pm

There is Filter sensor - I use it’s throttle option to make my incoming data more even. I believe you can do similar thing even if your devices do not support any configuration.

dave_t_uk · March 4, 2020, 12:43pm

@AhmadK Thanks, I was not aware of the filter. This addresses most of the issue, but not 100%.

A filter with throttle or time_throttle will store less in the database but also have a slow update rate on the screen and there would be automation lag.

For example, if I want to store wind speed sensor data only once per hour, but also had an automation triggered by wind speed above a threshold, there could be up to 1h lag on the trigger, and gusts might not trigger it at all.

I was thinking of a way that the internal HA state could be ‘live’ unfiltered, but the recorded data would be throttled. So more like a time_throttle on the DB output rather than the sensor input.

AhmadK · March 4, 2020, 4:27pm

you’ll have 2 sensors - raw and filtered. store the filtered one (include it in recorder) and use the raw one in automations (and exclude it from recorder) (rough idea)

dave_t_uk · March 4, 2020, 10:26pm

Yes, this works. But it’s messy and manual and creates several new problems:

in the UI if I click on the ‘raw’ sensor, I used to get a history graph, now I get nothing.
if I add two entities to the UI (one raw, one filtered for the graph), they display different values to eachother which is confusing.
the UI Icon for the filtered value changes to a wiggly line rather than the icon for the unit it is representing.

I think that at a sensor level there could be a new option available:

record_time_throttle

which enforces that the recordings are at least that many seconds apart. It could be default 0 (record everything), but sensor integrations that are prone to generating lots of data could set it to something else by default.

There is really quite a lot of data getting unnecessarily recorded, more examples:

continuous link quality of every zigbee device!!
rotate a wireless dimmer: HA writes 100s of state updates to the SD card.

I don’t think it is reasonable to expect users to edit configuration.yaml to create a separate filtered sensor for each dimmer or disable recording of chatty devices just keep HA responsive.

It ought to be possible for MQTT sensors to configure a record_time_throttle themselves during initial setup without user involvement.

If I made a PR for this would anyone be interested?

AhmadK · March 4, 2020, 11:41pm

if you’ll be able to convince devs it’s a valid addition then why not?

dave_t_uk · March 10, 2020, 1:13pm

Some initial progress on this. It’s not a trivial problem to solve. It’s fairly straightforward to block additional writes to the database that occur in a configurable short timespace, but I expect that if there are multiple readings in quick succession, then the last one is the one we should keep.

The problem is that once a db write has been blocked, there is no new event on the bus to trigger any database write after the block period has ended. I don’t know the HA underlying framework very well, but I’m looking for a way to trigger a deferred state_changed or a custom event to be able to write the final value once a burst of values has finished.

For example, without this, a rapid turn of a dimmer would probably only record the first increment in the database and discard everything else.

Implementation suggestions welcome!

futrix · October 27, 2020, 1:59pm

Maybe there should be some sort of cache layer between home assistant and it’s database. For example redis (or other in-memory nosql db). All sensor readings/state changes could be written to this layer instead db and there would be another process responsible for moving records from cache to actual db (it could also do filtering/post-factum throttling/data modification and so on). I am not a developer, it’s just an idea.

AaronCake · October 27, 2020, 8:56pm

No need to do anything to manipulate database writes, throttle sensor data and other tricks if the database structure was efficient.

dave_t_uk · December 8, 2020, 6:35am

I don’t fully agree with this. My storage is getting blasted by thousands of writes, to the point that I regularly clone my sd card ready to cycle in the next one when it fails.

I use RPi on purpose to save energy compared with the full computer I used to use for HA.

It’s not just about DB size or making queries efficient. It’s about reducing the number of entries by a huge factor.

Yes, DB storage efficiency would help too (a good idea - yes please), but would make no difference on it’s own to the number of file system journal updates.

If a device is generating data every ms, but you only need to keep 1 per minute, you would reduce by a factor of 60000.

I think a cache layer in ram between the ha state updates and the DB might be an option. The ideal would be for HA to see everything as normal, queries of recent events would be looking at the cache, and only important stuff would make it to the filesystem.

tteck · December 8, 2020, 7:09am

I run my database in memory only ( db_url: ‘sqlite:///:memory:’ ) I don’t need weeks of data to make pretty graphs, just a few days to help troubleshoot if anything goes awry.

Haddood · December 8, 2020, 10:36am

few days back I changed recorder from one second to 2 seconds and RPI 3B+ performance improved alot, DB size slashed to half. I started by excluding everything than started to include only the things I’d like to have history for.
that is being said, it is a very manual process. I think the idea of this post have lots of merit.

I am not sure how HASS store information. each cycle (1 sec …etc. specifed in recorder) it dumbs everything or it just records entities that has changed … if it is the first, it means lot of the computing power is wasted on writing data even SSD is slower than CPU which has big impact on RPI systems … changing it to option 2 would improve performance drastically

it would be nice if

each entity has a check box that is turned off by default to include it in recorder or 2.
each entity has it is own time cycle (throttle filter won’t change anything if everything is written anyway) some data needs to be recorded by seconds other needs to be recorder by minutes and some like tempreture ca be sufficient if recorder every hour …

the point there is a lot of space for improvement here.

that being said, we can not complain and have to thank the developers for the amazing work they are doing and sharing with us.

dave_t_uk · December 9, 2020, 5:34am

Yes, this is what I initially thought, but analysing the code made me think it’s not that simple. DB writes are driven by events such as incoming data. If there’s no event/no state change, then nothing gets written (good). But if there are rapid repeated state changes, what do you do?

Blocking too many writes in a short time would lead to the true final state not being recorded correctly. Eg open and close a door then wait 1h would result in 1h with recorded state as open. Probably some automations would go wrong.

The challenge is detecting when a sequence of changes has finished, then writing the final state only. But there is no event eg for 1second after the last change.

Perhaps each rapid state change event could trigger a deferred write, which got replaced by a new deferred write if ‘last changed’ was too recent. if the value was changing every second but we only needed once per minute, we’d have to do something like a non-deferred DB write each minute to ensure we are not perpetually deferring everything.

But I still think HA makes assumptions that the DB represents a true representation of the state history.