Over the last week or so I’ve been looking into what is causing my database to grow so big. When I upgrade home assistant this last time it took almost 2 hours to start up because of some database operations is was doing on a 20+GB sqlite3 database. I know I don’t have 20GB of truly useful data in all that.
TL;DR Found the offenders and now only include entities that I want to see in either History or Logbook.
I’m sure someone will find these helpful so here are a couple queries that helped by find the culprits. Note, I’m using the default SQLite database on Hassbian. To speed things up and prevent corrupting my database I stopped home assistant, copied the homeassistant_v2.db over to my desktop, and began my investigation.
One of the first things I did was run sqlite3_analyzer and got a report on which tables were the biggest.
Events table
The events table accounts for 2/3 of the database size. Every sensor event the “event_data” field includes a full copy of the json for both the old and new states. This table doesn’t have an entity_id column so it makes it hard to track down which sensors to blame. The following query with a common table expression (CTE) can do that but beware if you have a multi-GB database it could take a really long time. The result of this query is each entity id, the total size of all event_data, the number of events for that entity_id, and the average size of event_data per event.
WITH step1 AS (
SELECT *, INSTR(event_data, '"entity_id": "') AS i
FROM events),
step2 AS (
SELECT *, CASE WHEN i>0 THEN SUBSTR(event_data, i+14) ELSE '' END AS sub
FROM step1),
step3 AS (
SELECT *, SUBSTR(sub, 0, INSTR(sub, '"')) AS entity_id
from step2)
SELECT entity_id, SUM(LENGTH(event_data)) size, COUNT(*) count, SUM(LENGTH(event_data))/COUNT(*) avg
FROM step3
GROUP BY entity_id
ORDER BY size DESC;
To investigate a single event by entity_id use this (make sure to keep that trailing double quote where it is):
SELECT * FROM events WHERE event_data LIKE '%sensor.home_power_main"%';
States table
The states table accounts for the other 1/3 of the database. Make essentially the same query there is a lot simpler and faster as it includes and is indexed on an entity_id column. Similarly the attributes is the big column here as it contains the json blob for the state.
SELECT entity_id, SUM(LENGTH(attributes)) size, COUNT(*) count, SUM(LENGTH(attributes))/COUNT(*) avg
FROM states
GROUP BY entity_id
ORDER BY size DESC;
To investigate a single entity_id in the states table use this:
SELECT * FROM states WHERE entity_id = 'group.thermostats';
Discoveries
Here are a few of my take aways after investigating and tweaking:
- The bigger the database the slower the startup and shutdown
- Each restart dumps a bunch of events into the database
- Disabling
history
andlogbook
or excluding entities from either has no effect on the database. It will keep storing all the data even if you’re not looking at it. - Most of my database is json blobs
- Most of the json blobs are duplicate data
- z-wave devices store a ton of information! A single Aeotec Smart Home Energy meter has 25 different sensors (25!!!) I only need 3 of those for viewing useful data in history (zwave spec here)
- The zwave option “Exclude this entity from Home Assistant” does not exclude it from recorder (not sure what it does…)
- If you don’t care to see it in History or Logbook don’t allow the recorder to persist it
Recorder inclusion and exclusion
Since 95% of my database entries were storing stuff I didn’t care about I began with the approach that I should include only what I want to see in History or Logbook. Home Assistant provides a way to do this in the recorder component configuration. I should also point out that automations don’t need anything from recorder or the database. They will keep working just fine with it disabled.
UPDATE: The include/exclude code has been re-written as of 0.57. I’ve made some comments on how it works in this later post. Basically you don’t have to do the weird trick to include only a few of a domain while specifying other include:domains.
The new include/exclude logic is found in the entityfilter
found here. Since it has been pulled out into a helper class it should make includes and excludes work the same in any configuration. The main improvement is that you can more easily do the selective include. To keep a trim I’d suggest taking the following approach:
- If you want all or most of a domain, put the domain in include:domains and the few things you don’t want in exclude:entities
- If you only want a few of a domain, put the few items in include:entities
- Everything else is excluded by default. This prevents new devices and components from filling your database with crap.
With that in mind here is an update to my config for home-assistant 0.57+:
recorder:
purge_interval: 2
purge_keep_days: 14
include:
domains:
- automation
- binary_sensor
- climate
- device_tracker
- input_boolean
- input_number
- media_player
- switch
entities:
- group.family
- group.guests
- sensor.centralite_3325s_0bbafb7a_1
- sensor.centralite_3325s_056b240f_1
- sensor.battery_phone
- sensor.disk_use_home
- sensor.energy_cost_yesterday
- sensor.home_energy_usage
- sensor.home_power_main
- sensor.home_power_c1
- sensor.home_power_c2
- sensor.energy_usage_today
- sensor.filesize_db
- sensor.main_garage_door_status
- sensor.pws_temp_f
- sensor.ram_use
- sensor.speedtest_download
- sensor.speedtest_upload
- sensor.speedtest_ping
exclude:
entities:
- switch.vision_zl7432_inwall_switch_dual_relay_switch_2
- switch.vision_zl7432_inwall_switch_dual_relay_switch_4
- switch.vision_zl7432_inwall_switch_dual_relay_switch_5
- switch.vision_zl7432_inwall_switch_dual_relay_switch_6
Details on my old hack...
I made several attempts to set the recorder includes and excludes working exactly they way I wanted and with as few devices needing to be specified as possible. It kept failing and then I read the source code and found out why. They way it works is not intuitive to me but I found a way to make it work. The false assumption I made is that you can use the include domains, include entities, and exclude entities. I found 3 scenarios for setting inclusions and exclusions, the 3rd being my case and a head scratcher:
- If you want to exclude everything from a domain, put it in exclude:domains
- If you want most of a domain but want to exclude a few things, put those few in exclude:entities
- If you only want a few of a domain,
put the entity in include:entities and domain in exclude:domain and include:domain. If you don’t add it to the include:domains it will get excluded here. If you don’t add the domain to the exclude:domains then everything else from the domain will also get through..
You’ll notice that there are domains (like sensor) doubled up in exclude and include because of that 3rd point, but it works perfectly.
recorder:
purge_interval: 2
purge_keep_days: 14
exclude:
domains:
- camera
- cover
- group
- sensor
- sun
- updater
- zone
- zwave
entities:
- switch.vision_zl7432_inwall_switch_dual_relay_switch_2
- switch.vision_zl7432_inwall_switch_dual_relay_switch_4
- switch.vision_zl7432_inwall_switch_dual_relay_switch_5
- switch.vision_zl7432_inwall_switch_dual_relay_switch_6
include:
domains:
- automation
- binary_sensor
- climate
- device_tracker
- group
- input_boolean
- input_number
- media_player
- sensor
- switch
entities:
- group.family
- group.guests
- sensor.centralite_3325s_0bbafb7a_1
- sensor.centralite_3325s_056b240f_1
- sensor.battery_phone
- sensor.disk_use_home
- sensor.energy_cost_yesterday
- sensor.home_energy_usage
- sensor.home_power_main
- sensor.home_power_c1
- sensor.home_power_c2
- sensor.energy_usage_today
- sensor.filesize_db
- sensor.main_garage_door_status
- sensor.pws_temp_f
- sensor.ram_use
- sensor.speedtest_download
- sensor.speedtest_upload
- sensor.speedtest_ping
Sorry that was so long winded. I hope it helps people chase down their log hogs!