Large homeassistant database files

I do the command line sensor like what ntalekt posted. Unfortunately the way docker runs on my Synology NAS, I had to run the command as root on my NAS and send the output to a file in the shared config folder, then within HA read that file to grab the value.

As for the database size, mine grows about 20-30MB per day. currently sitting at about 550MB and just restarted HA to update (though it was from docker) and it booted right up. Never had a real issue with the database or restarting HA since switching to MySQL/MariaDB.

1 Like

That’s a good tip. I might do something similar! Thanks

No problem Phil. Thanks for your glorious blog posts, they’ve helped me a lot.

Thanks for sharing, @ntalekt ! I used your code, and it’s up and running. For what it’s worth, I had to do the following extra steps to get it working:

  • install mysql-client (as I’m not using a MySQL server on my Raspberry Pi 3) with the following command: sudo apt-get install mysql-client
  • add the options -h 192.168.xxx.x -P 3307 to the command as my database is located on my Synology NAS
2 Likes

Ha ha, great minds think alike! Just yesterday I made a database size sensor so I could monitor the growth of the file. I’m using Hassbian with the default SQLite database so my sensor is just a filesize sensor.

For those wanting to do that for the basic SQLite db file (or any other file for that matter), here is the setup:

  1. Add a command_line sensor to your config.
    This uses the stat command with -c %s format option to retrieve the total file size in bytes. I divide by 1024 twice to get from bytes to MB. You can tweak the template to your liking. I also bumped the scan_interval up from the default of 60s to 900s (15 minutes) as I’ve been trying to reduce how much data is going into my database.
  - platform: command_line
    name: 'filesize_db'
    command: 'stat -c %s "/home/homeassistant/.homeassistant/home-assistant_v2.db"'
    unit_of_measurement: 'MB'
    value_template: '{{ (value|int / 1024 / 1024)|round(1) }}'
    scan_interval: 900
  1. I like to add an entry to customize too for name and icon (remember that’s in customize under homeassistant)
- sensor.filesize_db:
  friendly_name: SQLite DB size
  icon: mdi:database

image

Hope that helps others using the default database setup.

5 Likes

Over the last week or so I’ve been looking into what is causing my database to grow so big. When I upgrade home assistant this last time it took almost 2 hours to start up because of some database operations is was doing on a 20+GB sqlite3 database. I know I don’t have 20GB of truly useful data in all that.

TL;DR Found the offenders and now only include entities that I want to see in either History or Logbook.

I’m sure someone will find these helpful so here are a couple queries that helped by find the culprits. Note, I’m using the default SQLite database on Hassbian. To speed things up and prevent corrupting my database I stopped home assistant, copied the homeassistant_v2.db over to my desktop, and began my investigation.

One of the first things I did was run sqlite3_analyzer and got a report on which tables were the biggest.

Events table

The events table accounts for 2/3 of the database size. Every sensor event the “event_data” field includes a full copy of the json for both the old and new states. This table doesn’t have an entity_id column so it makes it hard to track down which sensors to blame. The following query with a common table expression (CTE) can do that but beware if you have a multi-GB database it could take a really long time. The result of this query is each entity id, the total size of all event_data, the number of events for that entity_id, and the average size of event_data per event.

WITH step1 AS (
    SELECT *, INSTR(event_data, '"entity_id": "') AS i
    FROM events),
step2 AS (
    SELECT *, CASE WHEN i>0 THEN SUBSTR(event_data, i+14) ELSE '' END AS sub
    FROM step1),
step3 AS (
    SELECT *, SUBSTR(sub, 0, INSTR(sub, '"')) AS entity_id
    from step2)
SELECT entity_id, SUM(LENGTH(event_data)) size, COUNT(*) count, SUM(LENGTH(event_data))/COUNT(*) avg
FROM step3
GROUP BY entity_id
ORDER BY size DESC;

To investigate a single event by entity_id use this (make sure to keep that trailing double quote where it is):

SELECT * FROM events WHERE event_data LIKE '%sensor.home_power_main"%';

States table

The states table accounts for the other 1/3 of the database. Make essentially the same query there is a lot simpler and faster as it includes and is indexed on an entity_id column. Similarly the attributes is the big column here as it contains the json blob for the state.

SELECT entity_id, SUM(LENGTH(attributes)) size, COUNT(*) count, SUM(LENGTH(attributes))/COUNT(*) avg
FROM states
GROUP BY entity_id
ORDER BY size DESC;

To investigate a single entity_id in the states table use this:

SELECT * FROM states WHERE entity_id = 'group.thermostats';

Discoveries

Here are a few of my take aways after investigating and tweaking:

  • The bigger the database the slower the startup and shutdown
  • Each restart dumps a bunch of events into the database
  • Disabling history and logbook or excluding entities from either has no effect on the database. It will keep storing all the data even if you’re not looking at it.
  • Most of my database is json blobs
  • Most of the json blobs are duplicate data
  • z-wave devices store a ton of information! A single Aeotec Smart Home Energy meter has 25 different sensors (25!!!) I only need 3 of those for viewing useful data in history (zwave spec here)
  • The zwave option “Exclude this entity from Home Assistant” does not exclude it from recorder (not sure what it does…)
  • If you don’t care to see it in History or Logbook don’t allow the recorder to persist it

Recorder inclusion and exclusion

Since 95% of my database entries were storing stuff I didn’t care about I began with the approach that I should include only what I want to see in History or Logbook. Home Assistant provides a way to do this in the recorder component configuration. I should also point out that automations don’t need anything from recorder or the database. They will keep working just fine with it disabled.

UPDATE: The include/exclude code has been re-written as of 0.57. I’ve made some comments on how it works in this later post. Basically you don’t have to do the weird trick to include only a few of a domain while specifying other include:domains.

The new include/exclude logic is found in the entityfilter found here. Since it has been pulled out into a helper class it should make includes and excludes work the same in any configuration. The main improvement is that you can more easily do the selective include. To keep a trim I’d suggest taking the following approach:

  1. If you want all or most of a domain, put the domain in include:domains and the few things you don’t want in exclude:entities
  2. If you only want a few of a domain, put the few items in include:entities
  3. Everything else is excluded by default. This prevents new devices and components from filling your database with crap.

With that in mind here is an update to my config for home-assistant 0.57+:

recorder:
  purge_interval: 2
  purge_keep_days: 14
  include:
    domains:
      - automation
      - binary_sensor
      - climate
      - device_tracker
      - input_boolean
      - input_number
      - media_player
      - switch
    entities:
      - group.family
      - group.guests
      - sensor.centralite_3325s_0bbafb7a_1
      - sensor.centralite_3325s_056b240f_1
      - sensor.battery_phone
      - sensor.disk_use_home
      - sensor.energy_cost_yesterday
      - sensor.home_energy_usage
      - sensor.home_power_main
      - sensor.home_power_c1
      - sensor.home_power_c2
      - sensor.energy_usage_today
      - sensor.filesize_db
      - sensor.main_garage_door_status
      - sensor.pws_temp_f
      - sensor.ram_use
      - sensor.speedtest_download
      - sensor.speedtest_upload
      - sensor.speedtest_ping
  exclude:
    entities:
      - switch.vision_zl7432_inwall_switch_dual_relay_switch_2
      - switch.vision_zl7432_inwall_switch_dual_relay_switch_4
      - switch.vision_zl7432_inwall_switch_dual_relay_switch_5
      - switch.vision_zl7432_inwall_switch_dual_relay_switch_6
Details on my old hack...

I made several attempts to set the recorder includes and excludes working exactly they way I wanted and with as few devices needing to be specified as possible. It kept failing and then I read the source code and found out why. They way it works is not intuitive to me but I found a way to make it work. The false assumption I made is that you can use the include domains, include entities, and exclude entities. I found 3 scenarios for setting inclusions and exclusions, the 3rd being my case and a head scratcher:

  1. If you want to exclude everything from a domain, put it in exclude:domains
  2. If you want most of a domain but want to exclude a few things, put those few in exclude:entities
  3. If you only want a few of a domain, put the entity in include:entities and domain in exclude:domain and include:domain. If you don’t add it to the include:domains it will get excluded here. If you don’t add the domain to the exclude:domains then everything else from the domain will also get through. .

You’ll notice that there are domains (like sensor) doubled up in exclude and include because of that 3rd point, but it works perfectly.

recorder:
  purge_interval: 2
  purge_keep_days: 14
  exclude:
    domains:
      - camera
      - cover
      - group
      - sensor
      - sun
      - updater
      - zone
      - zwave
    entities:
      - switch.vision_zl7432_inwall_switch_dual_relay_switch_2
      - switch.vision_zl7432_inwall_switch_dual_relay_switch_4
      - switch.vision_zl7432_inwall_switch_dual_relay_switch_5
      - switch.vision_zl7432_inwall_switch_dual_relay_switch_6
  include:
    domains:
      - automation
      - binary_sensor
      - climate
      - device_tracker
      - group
      - input_boolean
      - input_number
      - media_player
      - sensor
      - switch
    entities:
      - group.family
      - group.guests
      - sensor.centralite_3325s_0bbafb7a_1
      - sensor.centralite_3325s_056b240f_1
      - sensor.battery_phone
      - sensor.disk_use_home
      - sensor.energy_cost_yesterday
      - sensor.home_energy_usage
      - sensor.home_power_main
      - sensor.home_power_c1
      - sensor.home_power_c2
      - sensor.energy_usage_today
      - sensor.filesize_db
      - sensor.main_garage_door_status
      - sensor.pws_temp_f
      - sensor.ram_use
      - sensor.speedtest_download
      - sensor.speedtest_upload
      - sensor.speedtest_ping

Sorry that was so long winded. I hope it helps people chase down their log hogs!

25 Likes

hi, i am trying to install mysql db on a fresh clean HASSBIAN image (2017-10-11-Hassbian), but i fail right at the first step: “sudo apt-get install libmysqlclient-dev” fails with an error about “Package ‘libmysqlclient-dev’ has no installation candidate”.
When I continue anyway, after installing the mysql-server I was not prompted to set a root password. but when i try to log in i am asked for a password and and leaving it blank fails with an authentication error, so i guess there is a mysql password set in the HASSBIAN image already. but i can’t find any info what it might be by default :frowning:
Help appreciated…

Is there any speed advantage to setting up mysql on the same Pi as HASS is running on? Or does mysql need to be on a different computer to speed up the historical data on HASS?

regardless of where MySQL is installed, you will see a vast improvement in speed when querying the history. I don’t think it matters where it is installed as the connection method would still be the same.

So it’s just that Mysql is much faster than the default SQlite?

They did some speed improvements to the recorder and database a few versions ago (0.51) but there is still a slight speed improvement using MySQL over SQLite.

Also it comes down to personal preference. I already had a MySQL (MariaDB actually) database setup for managing my media library through Kodi, so it was pretty easy to setup the database for HA. I also don’t like files growing wild in a folder like the SQLite database and the log file, so I take every effort to either relocate them and/or manage them better like configuring the recorder:, history:, logbook:, and logger: components.

1 Like

Ok, I set up MySQL on a separate debian server I had. I made a database called “hassdb” and a user “hass” and gave full control to that user.

When I point home assistant to it with this configuration.yaml:

recorder:
db_url: mysql://hass:[email protected]:3306/hassdb

I get errors that HASS can’t communicate with MySQL. I confirmed that the server is listening on 3306 with netstat -tln.

Any ideas?

Thanks!

Make sure the firewall is disabled: service firewalld stop or that you have allowed MySQL access through the firewall on port 3306.

Figured it out, had to bind to 0.0.0.0 in my.cnf

Is there a good program to quickly make plots of the data in the sql database, kind of like HASS does but for longer periods of time?

Try the Influxdb and Grafana combination.

Interesting observations there regards exclude/includes - thanks for that!

However I’m still struggling with the mass of data in my database. I’m using MySQL and am seeing around 1300 records per minute going into the events table! The mysqld service is under high load, hass takes forever to show graphs and if I don’t periodically purge my database manually then mysql fails to allow anything to happen to the events table - it fails to lock records when updating. I then have to manually delete records in small batches until things are ok again.
My setup isn’t that big - 15 light bulbs, 10 humidity/temperature sensors, a few switches, etc.

My recorder config is like this:

exclude:
domains:
- mqtt
- script
- sun
- updater
include:
domains:
- sensor
- binary_sensor
- switch
- light
- input_select
- input_boolean

However I still see rows like this:
{“domain”: “mqtt”, “service”: “publish”, “service_data”: {“topic”: “tinkerforge/bricklet/lcd_20x4/ofQ/write_line/set”, “payload_template”: < lots of code>"}

This call runs quite frequently to update a screen but I don’t need it logged.
Why do I still see mqtt service call records when they are set to be excluded?

And these everywhere:
{“service_call_id”: “85675568-79”}

How do I stop these filling up the log?

{“domain”: “script”, “service”: “refresh_lcdscreen”, “service_data”: {}, “service_call_id”: “85675568-76”}
This too - why is the script domain appearing in my logs with my config?

I just discovered the include/exclude logic has been re-written as of 0.57 and it alters previous behavior in certain circumstances. I’ve extracted the logic to help others understand what it now does. I used the code for 0.57.3. I’m not sure how to get this into the official documentation but it would be helpful to have a page with examples to help people get the right config. The good thing is that this new revision is more intuitive and would have saved me a lot of time in the past…

When the recorder is initialized it builds one of 6 filter functions based on what you have in your config:

  • Case 1 - No includes or excludes configured
    Everything passes/written to recorder

  • Case 2 - Only includes configured
    Nothing is recorded unless the domain or entity are specifically included

  • Case 3 - Only excludes configured
    Everything is recorded unless the domain or entity are excluded

  • Case 4 - At least one include domain and one exclude (domain or entity) configured
    Only records if:

    • Domain included and entity not excluded
    • or, Entity in include:entities

    Everything in the exclude:domains is ignored.

  • Case 5 - No include:domains, at least one include:entity, at least one exclude:domain configured
    Everything is recorded unless:

    • Domain in exclude:domains and entity not in include:entities
    • or, Entity in exclude:entities
  • Case 6 - No include:domains, no exclude:domains, at least one in include:entities, and at least one in exclude:entities
    Nothing is recorded unless the entity is in include:entities. Ignores all exclude:entities. This is pretty much a misconfiguration of case 2 and the exclude:entities should be removed.

I feel like the change was an improvement over the previous implementation and all its quirks. However this means the post I made a few weeks ago isn’t correct for HASS 0.57+.

@Bottesford - Assuming you provided your entire recorder config and you updated to 0.57 then your configuration falls under Case 4 (or Case 4a as called in code). This means your exclude:domains config is ignored but it should still have the same effect. To help further I’d have to know what version you are running. Also make sure you have verified your config and restarted HASS (recorder includes/excludes cannot be refreshed while running). Also make sure the entries you’re finding are new ones since the database won’t clean out those old entries once excluded (until a purge).

2 Likes

running home assistant 0.59, i realized why the db file is so big, i run mysensors component to read and send off energy meter readings to HA, i just browsed the sqlite database and on each row on that table:

“773194” “sensor” “sensor.energy_meter_0_2_2” “404.685” “{”“V_KWH”“: ““349.661"”, ““V_VA””: ““404.685"”, ““device””: ““mqtt””, ““node_id””: 0, ““unit_of_measurement””: ““VA””, ““V_VAR””: “”-219.044"”, ““description””: “”””, ““V_POWER_FACTOR””: "“0.845"”, ““battery_level””: 0, ““V_WATT””: "“340.271"”, ““friendly_name””: ““Energy Meter 0 2"”, ““child_id””: 2}” “2017-12-05 20:03:12.046012” “2017-12-05 20:03:12.046012” “2017-12-05 20:03:12.586282”

why on earth the attribute field is filled every time with same gibrish data of other sensors within the same child_id ?
and why entity_id is not replaced with an index ?

hi,

please allow me this related question:
ive made a little script to try to keep the files under control and purge on demand.
How do i know something is happening, and/or when it is finished…?

  purge_recorder:
    alias: Purge Recorder
    sequence:
      - service: recorder.purge
        data:
          keep_days: 5

thanks,
Marius