How large is your database (Database Optimization)?

Petrica · July 1, 2020, 8:54pm

Hi,

Is there any way to delay HA until MariaDB addon starts (0.112 updated today, no issues before)?

Connecting to the same machine MariaDB addon results in logbook (and all related services) failure. However, with same settings on a different ~~Hassio~~ machine everything works fine.

LE1: 0.112 failed to upgrade the database. After troubleshooting, it seems that having logbook_cache active at the time of upgrade, combined with large volume of records completely botched the structure of the database and I had to start from scratch with a new MariaDB addon instance.

LE2: for about 12 hours I used MariaDB on a different machine before starting from scratch. This is the impact on the machine (Intel NUC with Celeron J4005 and 8 GB of RAM); although it is quite early for a complete picture it seems that 0.112 is taxing a little bit more on the RAM (logbook loads almost immediately, history takes 4-6 seconds, database grows by ~40MB each hour):

0.111.4 (550 entities monitored; MariaDB addon on same machine with HA; logbook_cache used): CPU used: 18-20%; 15m load: 0.22-0.25; memory used: 20-22%

0.112 (all 2,000 entities monitored; MariaDB addon on different machine; no logbook_cache): CPU used: 8-11%; 15m load: 0.11-0.14; memory used: 12-14%

0.112 (all 2,000 entities monitored; MariaDB addon on same machine with HA; no logbook_cache): CPU used: 16-20%; 15m load: 0.18-0.22; memory used: 25-30%

LE3: the second instance above might not be fully comparable as it was over a period of 12 hours, most of it during the night, thus the events recorded were significantly fewer. Overall, last updates to HA brought huge performance improvements as startup time decreased from 180-200 seconds (pre 0.110) to 16-20 seconds for frontend to come up and ~50 seconds for all the components to load (probably can decrease even further by cleaning up some of the components I hardly ever use).

AaronCake · July 2, 2020, 2:54pm

Typically this would be handled with multiple tables.

For example, a tblAttributes table which contains columns of primary key and “Attributes”.

That table would contain all the possibly entity attributes. It would up up the database abstractor to handle additions of attributes if it is called to write or read one that doesn’t already exist.

The second table, tblValues, has columns of primary key, attribute key, event key and the actual values. Likely distill all values to either string (varchar?) and numeric (float, long int, whatever is best) and have a column for each value type. The abstractor will determine which column to write/read based on the data type. The attribute key links to the matching column in tblAttributes,

Then the Events table links each row with tblValue columns having the same event key. tblValue links each row with its corresponding attribute key.

Reading is just a matter then of the correct JOINS. Writing takes a little more work as one must first determine if there is a matching attribute and get the associated primary keys for the needed attributes. But just another query…Or the attributes table could just be kept a a collection in memory since it doesn’t change often and simply re-loaded as needed. All this would be abstracted in the database code so no other changes would need to be made unless there are components or integrations performing direct database queries relying on the old structure.

If I was at all competent in Python I’d help.

Another good example of how to store large amounts of data efficiently is Zabbix:
https://zabbix.org/wiki/Docs/DB_schema/4.0

bdraco · July 2, 2020, 3:51pm

That would certainly make the database smaller. The downside is you now need a complex memory cache to make writes performant, queries get much more complex, and purging old data becomes extremely complex because you have figure out which attributes and values are no longer referenced.

While that is all doable, it would significantly lower the number of potential maintainers as it would steeply increase the learning curve needed to make changes and not break all that complexity.

AaronCake · July 3, 2020, 4:05pm

Comments interspersed…

Maybe. Usually a query is done first to obtain the needed primary keys. Indexed read queries have almost no overhead compared to writes so impact should be microseconds of time.

They should get much more simple as most of the work is now done by the DB engine through a few joins vs. processing all that JSON in code. One query pulls an event with all the attributes and values attached as a single row, each neatly in a column.

Kind of like “SELECT FROM (stuff) FROM Events LEFT JOIN Events ON event_id=tblValues.event key INNER JOIN tblValues ON tblAttributes.primarykey = tblValues.attribute keyWHERE (conditions)”

Probably multiple errors in there but you get the idea. No more event_data column, storing loads of JSON text which can’t be optimized or compressed very well (if at all), nor more processing JSON in code, etc.

[quote]
and purging old data becomes extremely complex because you have figure out which attributes and values are no longer referenced.[/quote]

It’s the SELECT query in reverse as a DELETE query.

Or in case of the “Attributes” table, probably note even a need to purge.

A bit like moving from YAML to the .storage folder then? It’s all standard SQL which most developers should be familiar with if they deal with databases. Not too complex as it is basically the setup of every relational database. I can’t find the database code in the Github or any info on how it is abstracted in the Developer page, but I’d assume that the database communication is abstracted so components, integrations, etc. don’t have to manipulate it directly? In that case, the only developers needed with the skills would be the ones rewriting the database abstraction layer for the new schema. Anything above is unchanged.

I’d certainly be willing to help as I’ve implemented many of these types of databases before, except for the major fact I have zero Python skills. But the database really is a liability in Home Assistant. Something that has to be actively managed by users, not allowed to grow, treated gently. Because of a fundamental architectural problem which really has no solution other than a redesign. It’s going to have to happen sooner or later especially as Home Assistant moves more mainstream with GUI configuration and all the usability improvements lately. Right now it is a ticking time bomb in every installation, especially those with SD cards. When passing 10 values to Home Assistant via MQTT every second is enough to bloat the database by 4+ GB every week (my personal experience with an energy monitor) something has to be done. In an otherwise amazing project it’s the one significant flaw.

And I hope that what I’ve said doesn’t come across too harshly. I’m in debt to the excellent Home Assistant developers for such a versatile and reliable platform. And do hope to contribute with documentation of my hardware designs and implementation once I’ve reached a point where they are worth sharing in detail.

bdraco · July 3, 2020, 4:22pm

The database models are here:

github.com

home-assistant/core/blob/5ecb3f9be267482a05a3cf5bfe037afaba7537e6/homeassistant/components/recorder/models.py

"""Models for SQLAlchemy."""
import json
import logging

from sqlalchemy import (
    Boolean,
    Column,
    DateTime,
    ForeignKey,
    Index,
    Integer,
    String,
    Text,
    distinct,
)
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm.session import Session

from homeassistant.core import Context, Event, EventOrigin, State, split_entity_id
from homeassistant.helpers.json import JSONEncoder

This file has been truncated. show original

Everything is https://www.sqlalchemy.org/ based.

bdraco · July 3, 2020, 5:52pm

I should probably point out that event_data is now empty for state_changed events which is why I’ve been focusing on the attributes issue.

I’m not sold on adding additional tables to track attributes and attributes values is worth the added complexity since most of the overhead went away when we stopped storing the old and new state in the events table in json.

I did some testing to with a copy of my existing database. Emptying the existing attributes column decreased the database size by 19%. I’m estimating the cost of adding the new attributes and attributes values tables will be ~5%. So my best estimate for reworking the attributes is a 14% savings.

squirtbrnr · July 3, 2020, 6:08pm

Did the upgrade to 0.112 and my database converted without issue. A 4.3Gb MariaDB database in docker took about 2-3 minutes to convert running HA on docker on a Synology DS1515+. However, the database size went up about 300Mb right after the conversion. It wasn’t until I ran a manual purge with repack about 8 hours later that it finally went down and settled around 4.4Gb. My purge keep days is at 28. I record roughly ~600 entities and some are pretty chatty. I’ll keep monitoring over the next few days and weeks to see how the database settles down. History and logbook are lightning fast.

bdraco · July 3, 2020, 6:11pm

You should start to see the database go down in size as the old format is purged over time after a repack.

Moonbase59 · July 3, 2020, 6:45pm

What’s the default # days the DB stores? I never changed anything, and DB size is ~2.3GB.

Btw, if you want a sensor to monitor it (for SQLite installs):

  # HA database size
  - platform: sql
    # Not needed when using recorder, because we already have a connection.
    # Absolute path begins after the 3rd "/" (location for Hassbian).
    #db_url: sqlite:////home/homeassistant/.homeassistant/home-assistant_v2.db
    scan_interval: 600
    queries:
      - name: Database
        # This is the fastest solution, available since sqlite 3.16.
        # It’s the same info as returned by ".dbinfo" in the sqlite3 cmdline client.
        query: 'SELECT page_count * page_size as size FROM pragma_page_count(), pragma_page_size();'
        column: 'size'
        unit_of_measurement: 'Bytes'

Hint: You can always display this value in a template in a nice-to-read way:

{{ (state_attr('sensor.database','size') | filesizeformat(binary=false)) }}

bdraco · July 3, 2020, 6:51pm

10 days

Moonbase59 · July 3, 2020, 6:53pm

Thanks for the heads-up!

AaronCake · July 4, 2020, 2:25pm

I have to admit I’m still on .103 as I haven’t upgraded due to the requirement to upgrade my Python version. So I wasn’t aware there had been major changes to the database. Indeed removing the event_data JSON strings are going to be the single best speed and size optimization if keeping the current data structure. That’s definitely a much needed change.

Note that moving to a relational model isn’t necessarily about data size (though it eliminates large amounts of duplicate data) but about offloading processing from the client code to the database engine where it can be processed much more quickly.

I look forward to eventually upgrading to .112 and seeing the database upgrades.