Statistics - Fix Issue - large amount of data

Abstract

Introduction

If you face a lot of statistic issues to be fixed (in my case about 7000), and you don’t want to bother by clicking thousands of time the Fix Issue button, then clicking the Remove button in the dialog message this little how-to might be useful for you.

Unfortunately, the WebGUI doesn’t have a “Fix All” button, therefore we need to run the cleanup manually directly in the database.

Overall cleanup process

Disclaimer
As we will directly interact with the backend database, before executing any modification (UPDATE and/or DELETE queries), ensure that the selection contains only rows you want to modify linked to your specific case, otherwise it could result in permanent loss of data!!!. And don’t forget to backup your HomeAssistant to have a rollback opportunity

The process of the cleaning is the following:

  1. Identification of rows to clean
  2. Install and connect to sqlite3
  3. Enforce the DELETE CASCADE
  4. Run the DELETE query

1. Identification of rows to clean

In my specific case, these issues were mainly related to the Remote Home Assistant component which included all of its sensors to itself in a loop… results: a lot of orphaned entries in statistics_meta and statistics data.

For testing rows selection, I used the SQLite Web Add-on which is more user-friendly to build a test query than doing it with sqlite3 command line.
Open your Home Assistant instance and show the dashboard of a Supervisor add-on.

1.1 Limitation of SQLite Web add-on

By using this add-on I figured out there are some limitations:

  • Number of rows displayed is limited to 1’000 rows;
  • Cannot use features that are based on a connection like PRAGMA.

Unfortunately, the implementation of the SQLite Web add-on doesn’t allow to use PRAGMA as each request is sent to the DB with a new connection. As PRAGMA is connection-based, it will never remain active between queries.

This can be verified in the SQLite web add-on by simply executing:

PRAGMA foreign_keys=ON; 

then verify if it is activated or not with:

PRAGMA foreign_keys; 
  • returned value 0 means foreign keys are not enforced
  • returned value 1 means foreign keys are enforced

With SQLite web add-on the returned value is always 0

1.2 Table constraints

Looking at the home-assistant_v2.db database, fortunately, there are FOREIGN KEY integrity constraints with DELETE CASCADE on statistics_meta and statistics_short_term table which will simplify cleanup task.

This means we can easily delete orphaned metadata_id, in statistics and statistics_shorts_term by using sqlite3 PRAGMA and integrity constraints linked to statistics_meta.

1.3 Build the SELECT query

For me it was easy to identify orphaned data as most of them was prefixed with c16_, I can then execute a simple SELECT statement to find these orphaned data :

  • SELECT * FROM "statistics_meta" WHERE statistic_id like "%c16_%";

However, to have more control over what data need to be cleaned, and also fix potential other orphaned issues, I built up an advanced query (with JOIN) to verify the incriminated records and identify which data are existing in statistics_meta table but not in states.

Below you will find the SELECT query to find orphaned rows in statistics_meta:

SELECT 
	CASE WHEN entity_id IS NULL
		THEN 'Orphaned'
		ELSE 'OK'
	END Issue,
	entity_id, 
	statistic_id 
FROM	(
		SELECT 
			b.entity_id, 
			a.statistic_id				
		FROM 
			statistics_meta a
		LEFT JOIN 'states' b 
			ON a.statistic_id = b.entity_id
		GROUP BY 
			a.statistic_id -- we want to avoid duplicates
		);

Now that we have verified the data we want to delete, we need to modify the query to get only the IDs of rows:

SELECT 
	id
FROM	(
		SELECT 
			a.id,
			b.entity_id 		
		FROM 
			statistics_meta a
		LEFT JOIN states b ON a.statistic_id = b.entity_id
		GROUP BY 
			a.statistic_id
		)
WHERE 
	entity_id IS NULL;  /* NULL means the statistic_id is orphaned */

We can check the exact amount of data to be cleaned by adding a SELECT count(*) this way:

SELECT count(*)
(
  -- add here query you want to count rows
);

2. Install and connect to sqlite3

For running the cleanup query, due to the limitations of SQLite Web add-on using PRAGMA, we need to connect HomeAssistant with SSH.

sqlite3 is not installed by default in the official supervised HomeAssistant, then we need to execute the following command to install it:

apk add sqlite

Once installed, we can connect the DB by running the following command:

sqlite3 /config/home-assistant_v2.db

3. Enforce the DELETE CASCADE

First thing to do before running a DELETE query on statistics_meta is to enable PRAGMA in sqlite3:

PRAGMA foreign_keys=ON;

Then verify PRARMA is activated (result must be 1):

PRAGMA foreign_keys;

4. Run the DELETE query

We are now connected to sqlite3 with PRAGMA active to enforce DELETE CASCADE.

Take now the SELECT query built above and include it within a DELETE FROM like following:

DELETE FROM 'statistics_meta'
WHERE id IN (
	SELECT 
		id
	FROM	(
			SELECT 
				a.id,
				b.entity_id, 
				a.statistic_id				
			FROM 
				statistics_meta a
			LEFT JOIN states b 
				ON a.statistic_id = b.entity_id
			GROUP BY 
				a.statistic_id
			)
	WHERE 
		entity_id IS NULL  /* NULL means the statistic_id is orphaned */
	);

Finally run the DELETE query above in sqlite3 to clean orphaned IDs.

5 Likes

Hey man thanks for this! Why did you delete your feature request? :slight_smile:

I posted it to the GitHub instead of the community :slight_smile:

And also opened an issue for SQLite Web add-on

1 Like

Really great how-to @vinzcenzo ! :+1: :+1: :+1:

An hour ago I deactivated many entities (disabled, did not delete them cause I want to know that the integration provides those) I don’t need anymore (and never did in most cases). While they are for sure removed over time (my db retention time according recorder setting: 14 days) automatically from states table, thousands of entries remain in the statistics and statistics_short_term and statistics_meta tables.

Few notes to your original post including some questions (aiming at creating a “regular database cleanup how-to” every few months):

  1. Optimization: Adding
                ORDER BY Issue DESC;

to the end of

SELECT 
	CASE WHEN entity_id IS NULL
		THEN 'Orphaned'
		ELSE 'OK'
	END Issue,
	entity_id, 
	statistic_id 
FROM	(
		SELECT 
			b.entity_id, 
			a.statistic_id				
		FROM 
			statistics_meta a
		LEFT JOIN 'states' b 
			ON a.statistic_id = b.entity_id
		GROUP BY 
			a.statistic_id -- we want to avoid duplicates
		);

will make things a bit easier.

  1. Information: Your issue has been automatically closed due to lack of response, please consider reopening it (if you can as creator, otherwise hopefully one who can will do so or you open a new one by copy-pasting the existing one):
  1. Optimization: In terms of

please note, when running HA OS, sqlite already exists.

  1. Question: A very important question for whole steps 3 and 4 (maybe worth mentioning right before those steps and additionally in the Disclaimer section at the top?):
    Is it safe to run commands using sqlite3 on the CLI while the database (/config/home-assistant_v2.db) is still in use (HA is running)?

Stopping HA before would be more safe I think, but is it necessary?
Asking because some time ago I had DBeaver CE from another machine connected to the HA database - which corrupted the db quite quickly causing lot of trouble of course.

  1. Question: This gives the (90) entities with most database entries:
    SELECT entity_id, COUNT(*) as count FROM states GROUP BY entity_id ORDER BY count DESC LIMIT 90;
    Can you make a SELECT statement (I’m not into all that JOINing, my SQL is pretty rusty) which gives the (90) entities with most database entries AND having statistics data?
    So to make out of the colulmns entity_id | count
    this: entity_id | count | count_statistics (last one: number of rows) or entity_id | count | statistics_true (last one: has at least one statistics row)

  2. Error: This unfortunately does not work - at least not in SQLite Web:

SELECT count(*)
(
  -- add here query you want to count rows
);

Example:
grafik

What works instead (random example) is just adding another count column to an existing SELECT statement:

SELECT *, COUNT(*) as count
FROM "statistics"
WHERE "metadata_id"=500
ORDER BY created ASC;
  1. Question:
    Can you provide a SELECT and a DELETE FROM statement where I can only see/remove entries related to a specific entity (which still exists and has entries in states table)?
    (the IS NULL is not working here, obviously)

This would be the way to go for me (as mentioned in the beginning: I did not delete those entities, only disabled them - now I want to remove all that statistics history waste)

  1. Question: Overall: I think what would be great is a native option to achive removing statistics from the database. Like for events and states with Recorder - Home Assistant. Exactly the same (to cleanup single entities) AND allowing to use wildcards for a full cleanup (like your how-to does).
    Are you interested in creating such a feature request for the recorder? Otherwise I would create one, just let me know.

  2. Optimization: People who want to track their orphaned statistics entities without going to the backend might want to use

SELECT COUNT (*) AS "Orphaned statistic entities"
FROM	(
		SELECT 
			b.entity_id, 
			a.statistic_id				
		FROM 
			statistics_meta a
		LEFT JOIN 'states' b 
			ON a.statistic_id = b.entity_id
		GROUP BY 
			a.statistic_id -- we want to avoid duplicates
		)
WHERE "entity_id" IS NULL;

in a SQL integration sensor SQL - Home Assistant. This way I can view the number in my custom admin dashboard and even get notified about new entries with a notification automation.

  1. Information: See this feature request for putting all that in the existing dev-tools statistics section GUI:
  1. Information: See this feature request for an overview of all database hygiene situations, current solutions and open feature requests:

@vinzcenzo any chance you might dig through my former (admittedly “a bit longer”) post to reply to the questions?

Thank you in advance. Hope to get some traction on this some day. Unfortunately, there does not seem to be a lot of activity looking at the referenced links at Statistics - Fix Issue - large amount of data - #3 by vinzcenzo.

I have found myself in the situation where the database has retained long term stats for many physical sensors that I have now replaced with template sensors.

I replaced them with template sensors simply to round() the state so it changed less often, logged less often, and so I could remove all of the attribute data in which I had no historic interest. This saves me CPU and disk space

Long term stats now tells me there is an issue with “This entity is no longer being recorded.” and even prompts me to fix it. The UI does not provide me with no actual way to do this, but instead informs me of what I did to cause the “issue”, and “implies” that the information I need to resolve it can be found in the documentation for the recorder integration. Presumably the only viable solution is to start recording the sensor again?

image

This is a bit pointless really - I have to trawl the forum to identify the right SQL Query to run to non-destructively cascade delete the “orphaned” entries

I cannot see that finishing the job that was started would take long - it would provide great benefit to end users, a reduction in storage (which has been stated as a big goal for the dev team) and a slight performance boost (especially now that the statistics are being leveraged in the more-info history)

1 Like