(SOLVED) Performance of history data is extremely slow

OverloadUT · June 23, 2017, 9:50pm

UPDATE 8/12/2017: Version 0.51 contains optimizations that fix the history performance issues!

(Original post:)

I have been running Hass on a RPi 3 for quite some time now, and I have gotten to the point where my History page cannot open at all. I switched to MySQL a bit back to hopefully improve performance. It seemed to help, but still not enough. I’m certain that a big cause of this is that I have a few sensors that report every 10 seconds. Still though, a RPi 3 with a properly indexed database should be able to handle this just fine.

EDIT: My mistake - it turns out that I was looking at the events table for the below explanation, rather than the states table which does indeed have entity_id as an index. I am still poking around to see if there’s something that can be done to make these queries faster, but my below explanation is not the cause.

ORIGINAL POST:

I decided to peek in to the database and noticed something that seems quite disturbing: Every single entry in the history stores all of its data (including the critical entity_id) inside one big JSON blob. Now, I’m no DBA, but I am pretty sure this is massively slowing down any big queries, because no index can exist to grab entries for a single entity_id or any other such filters.

Does anyone have any background on why the data is stored this way, or (even better) a correction of my assumptions that this is a poor way to store it?

If this is the cause as I suspect, has anybody put any effort towards refactoring this system to store the data in columns? If not, I may take a crack at it myself.

VDRainer · June 23, 2017, 10:46pm

Hi @OverloadUT, this is a good question and i don`t know the answer.
I purge my data after 7 days and excluded sensors that spam the db, but already wondered about the way data gets stored.

moinmoin · June 25, 2017, 5:17pm

I experienced the same as @OverloadUT and wondered how data was stored in MySQL.I would also appreciate a well structured DB

andrey · June 25, 2017, 7:59pm

Look here:

github.com

home-assistant/core/blob/dev/homeassistant/components/recorder/models.py

"""Models for SQLAlchemy."""
import json
import logging

from sqlalchemy import (
    Boolean,
    Column,
    DateTime,
    ForeignKey,
    Index,
    Integer,
    String,
    Text,
    distinct,
)
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm.session import Session

from homeassistant.core import Context, Event, EventOrigin, State, split_entity_id
from homeassistant.helpers.json import JSONEncoder

This file has been truncated. show original

The DB is indexed by domain, entity_id, etc.
There is an attributes text json field with all the attributes.

OverloadUT · June 27, 2017, 2:00am

I have good news to share.

I have been digging deep in to the queries used by the History page and individual entity graph popups, and have found multiple places where indexes are not being used properly for these queries. By adding some specific multicolumn indexes I have managed to speed up some very common queries from 30 seconds to 3 seconds, and the query that times out on my History page has shrunk from over 900 seconds (before I killed it) to 15 seconds.

I will continue to dig in here and figure out what the specific best additions are and get a PR ready.

sunnythaper · June 27, 2017, 2:23am

Awesome news! Can’t wait to see the progress!

Mukund · June 27, 2017, 2:57am

Can you explain the indexes that you added? I have the same issue here.

RobDYI · June 27, 2017, 3:35am

I run this script every 30 mins to reduce entries a lot. I find if you don’t use time as a trigger it also helps. I think I read that a problem with the database is all entries are stored including ones where no state change has occurred such as in polling.

#!/bin/sh
mysql -uhass -phass <<QUERY_INPUT
USE HOMEBASE;
DELETE FROM events WHERE event_type = 'service_executed';
DELETE FROM events WHERE event_type = 'call_service';
DELETE FROM events WHERE event_type = 'component_loaded';
QUERY_INPUT

diplix · June 27, 2017, 3:18pm

since the History Statistics Sensor also relies heavily on queries of the history component, faster, better indexes and queries woiuld be a huge improvement for the History Statistics Sensor, too. right now my 6 statistics sensors need more than 4 minutes to query the on/off states of 6 entities, which also blocks HA startup for more than 4 minutes.

OverloadUT · June 27, 2017, 7:34pm

Sure thing. Here’s a summary of my research so far. After some more testing and validation of my fixes, I plan on making the necessary changes in a PR.

Scenario 1: accessing the history of a single entity (clicking on an entity on the frontend)

We have a couple problem queries. Here is one:

SELECT states.state_id AS states_state_id, states.domain AS states_domain, states.entity_id AS states_entity_id, states.state AS states_state, states.attributes AS states_attributes, states.event_id AS states_event_id, states.last_changed ASstates_last_changed, states.last_updated AS states_last_updated, states.created AS states_created
FROM states
WHERE (states.domain IN ('thermostat', 'climate') OR states.last_changed = states.last_updated)
  AND states.last_updated > '2017-06-25 22:02:07.718556'
  AND states.entity_id IN ('sensor.living_room_temp_diff')
  AND states.last_updated < '2017-06-26 22:02:07.718556'
ORDER BY states.entity_id, states.last_updated

We have two problems here. One is the states.domain IN ('thermostat', 'climate') OR states.last_changed = states.last_updated bit - it’s pointless because we’re already asking for a specific entity_id whose domain is always static, just causing the database to work harder.

Proposed Solution 1: Change the code to remove that domain filter when fetching history for a single entity.

The second problem is that we’re sorting by last_updated, which does not have a multicolumn index with entity_id. Creating this index:

CREATE INDEX ix_states_entity_id_last_updated ON states (entity_id, last_updated);

improves performance from 80 seconds to less than 2 - literally an order of magnitude!

Proposed Solution 2: Add this index to the db model

The second problem query in this scenario is a subquery used to grab the last state_id in a time range, which I will isolate here for clarity:

SELECT max(states.state_id) AS max_state_id
FROM states
WHERE states.created >= '2017-06-10 21:53:38.977455'
  AND states.created < '2017-06-25 22:30:55.106688'
  AND states.domain NOT IN ('zone', 'scene')
  AND states.entity_id IN ('sensor.aeotec_zw100_multisensor_6_temperature_2_1')
GROUP BY states.entity_id

First of all, I want to validate that this second query is necessary at all, especially for how expensive it is. I am suspicious that we should be doing this second query rather than simply grabbing the highest state_id from the first query in code.

Proposed Solution 3: I need to do more research on the necessity of this query in the first place.

Assuming that we will keep the query, we have a problem similar to the first query: the domain filter (states.domain NOT IN ('zone', 'scene')) is completely unnecessary, because we’re already filtering by a single entity_id. In fact, removing this condition improves the performance of this query from 30 seconds to less than half a second. 60x improvement!

Proposed Solution 4: Remove the domain filter from this query in the code, when fetching a single entity.
Proposed Solution 5: Remove the GROUP BY clause in this query, because we’re already filtering by a single entity_id.

Scenario 2: Opening the History page

This one is a much worse problem, mainly because it’s just the above issues multiplied by how many entities you have. On my RPi3, the query to open this page churns for over 900 seconds before I get impatient and give up. I have no idea how long, if ever, it will actually take. So I dug in here as well.

Here’s the first problem query I am looking at. It’s a subquery used on the History page to grab the last state_id for each entity.

SELECT max(states.state_id) AS max_state_id
FROM states
WHERE states.created >= '2017-06-10 21:53:38.977455'
  AND states.created < '2017-06-26 07:00:00'
  AND states.domain NOT IN ('zone', 'scene')
  AND states.domain NOT IN ('zone', 'scene')
  AND states.domain NOT IN ('automation', 'group')
GROUP BY states.entity_id

First up is the same concern with the second query: why do we need this? I’ll investigate and determine if there is a more efficient way to do this from the time series query rather than having a second query just to get the final state_id for each entity.

Proposed Solution 6: I need to do more research on the necessity of this query in the first place.

As for the query, right off the bat we can see some weird duplicate domain restrictions. The first two are seemingly global logic for the history page, and the last one is because of the specific domains I have excluded in my Hass config.

Proposed Solution 7: Remove erroneous domain filter

Next up, we’ve got an index issue. We have no multicolumn index appropriate for this query. I have been experimenting with different indexes to solve this one with varying success. MySQL is having issues picking the right index sometimes, so I am investigating. Here is the probably ideal index:

CREATE INDEX ix_states_created_domain_entity_id ON states (created, domain, entity_id);

When I can get MySQL to use this index, the query changes from so long that I lose patience to 10 seconds. That’s not as fast as I would like, but it’s still multiple orders of magnitude faster than before.

Proposed Solution 8: Figure out the ideal index for this query and add it to the db model

So, that’s my research so far. Each time I fix a query, I find the next worse query and investigate that one, so I don’t think this is all of the fixes that are needed. I will continue my research and experimentation and report back when I know even more!

rendili · June 27, 2017, 7:56pm

Excellent work @OverloadUT, really looking forward to the result

cgarwood · June 27, 2017, 7:57pm

There was some chatter in the devs chat today about other database optimization as well. Namely, the purpose/need of the events table. Since we have the states table that stores all state data every time there is a state_changed event, I don’t think we should be saving a JSON blob of all of the state data (including all of the previous state data!) inside the events table. That would cut database sizes in half (likely more) since the states table already contains all of this data.

trisk · June 27, 2017, 7:57pm

Hey, do you mind joining https://gitter.im/home-assistant/home-assistant/devs ? A few minutes before you posted this, we just had a discussion about database space and performance concerns (proposing deprecating the events table and making states into a time-series aggregation or some other schema change more suited for time span queries).

Mukund · June 27, 2017, 10:02pm

Wow, great work @OverloadUT. I’ll try and investigate more on this from my side as well.

cmsimike · July 3, 2017, 7:59pm

Hi everyone,

I believe my PR (#8255) in .48 should have fixed any performance issues you’ve seen with hass startup times or the history view with anyone with “significant” sized databases.

OverloadUT · July 3, 2017, 8:46pm

Hi cmsimike,

I saw your PR; thanks!

It addresses the single biggest bottleneck in startup and history views which is awesome. However, there are still a few significant areas that can be improved from my overview above. Specifically, your PR partially addresses Proposed Solution 8 (I am going to submit a third dimension to that index you added, which will keep all of your gains but add a few more for certain situations).

Right now I am working on fixing the way the domain filters get applied (solutions 1, 4, and 7) as well as figuring out if there’s a way to remove the entire second query from all of the history calls, which is slowing everything down (solutions 3 and 6)

cmsimike · July 3, 2017, 8:55pm

Excellent!

OverloadUT · August 12, 2017, 11:32pm

To anyone who was watching this thread for updates: All of my optimization work just went out in 0.51

fanaticDavid · August 13, 2017, 1:11am

Thank you so much for your awesome work, @OverloadUT !! And I’m sure I’m far from being the only person to be really happy with this release, just for the recorder improvements.

A little while ago I switched from MariaDB 5 to MariaDB 10 on my Synology NAS. Lately, loading history on any entity with frequent state changes had been taking a long time and I thought it was due to switching MariaDB versions. So I tried to tweak MariaDB’s variables to improve performance and, although that wasn’t completely unsuccesful, loading the history on some entities would still take several minutes. I even had to extend timeouts in my nginx config so the graphs had enough time to actually show.

Thanks to your hard work, loading the history on any entity now takes 10 seconds or less. For alot of the entitites, the graphs appear nearly instantly.
10 seconds still sounds like a long time, but in my case the difference is beyond significant. I suspect your work will show even more once I get my hands on a newer NAS with more RAM to run my MariaDB instance on.

OverloadUT · August 13, 2017, 2:58am

That’s awesome to hear!

If you want to check out exactly where your 10 seconds is coming from now, you can turn on DEBUG-level logging for both history and recorder:

        homeassistant.components.history: debug
        homeassistant.components.recorder: debug

This will give you debug entries on every query, like this:

2017-08-12 11:46:24 DEBUG (Thread-5) [homeassistant.components.recorder.util] converting 6 rows to native objects took 0.101798s
2017-08-12 11:46:24 DEBUG (Thread-5) [homeassistant.components.history] get_significant_states took 0.111902s
2017-08-12 11:46:24 DEBUG (Thread-5) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.015284s
2017-08-12 11:46:24 DEBUG (Thread-5) [homeassistant.components.history] getting 1 first datapoints took 0.047329s
2017-08-12 11:46:24 DEBUG (MainThread) [homeassistant.components.history] Extracted 7 states in 0.164105s

The key there is the “converting 6 rows to native objects” line - that is the biggest bottleneck right now, and it’s all native Python execution. That one won’t change no matter the db engine. I have some thoughts about how to increase performance there, but it’s far less critical