PSA: 2024.7 recorder problems

I think I might have a db older than May 2024 (mine was created at least a year ago). But I doubt you mean creation date. Could you clarify please?

I think I might have a db older than May 2024 (mine was created at least a year ago). But I doubt you mean creation date. Could you clarify please?

# sqlite home-assistant_v2.db
sqlite> select * from schema_changes;
1|33|2023-03-05 03:12:00.672507
2|34|2023-03-08 12:03:43.173219
3|35|2023-03-08 12:03:43.223264
4|36|2023-04-05 19:53:47.382529
5|37|2023-04-05 19:53:47.405016
6|38|2023-04-05 19:53:48.134406
7|39|2023-04-05 19:53:48.224035
8|40|2023-04-05 19:53:48.270669
9|41|2023-04-05 19:53:48.278902
10|42|2023-11-01 17:59:36.802397

Make sure the last row is 43 or higher. In the above example, its only 42.

gotcha, my db schema was at 44:

sqlite> select * from schema_changes;
...
30|42|2023-11-03 00:08:36.442936
31|43|2024-04-05 14:10:17.169585
32|44|2024-08-08 07:54:57.273545

fyi: Just finished executing the statements on a 25GB db. They took around 90 minutes to complete, mostly pegging one core at 100% for long periods of time. After uploading it back to HAOS, HA was able to successfully purge again.

1 Like

The Insert should be done in batches of 10,000 or 100,000. The drop table should be a rename and dropped after the conversion is complete. How can I go about helping with this? I do database work and do not understand GIT and such. Any pointers would be great.

The above is a quick solution for an offline migration. It is only a rough approximation of a 12-step SQLite migration, and does not exactly match the implementation HA uses internally.

I appreciate your help, but some suggestions apply to something other than this use case:

  • Batching would not make sense in this case as there are no row-level locks in SQLite, and doing it in batches would make it slower in most cases. That might make sense if this was for MariaDB with InnoDB, but not for this case. Additionally, SQLite shows the INSERT INTO new_X SELECT ... FROM X. syntax in its docs.
  • Dropping the temp table does not make sense either, as you would lose all the data since it has been renamed the new states table.

2024.8.1 is shipping very soon. Don’t forget to free up disk space before updating.

https://www.home-assistant.io/integrations/recorder/#disk-space-requirements

My DB is currently 8.9GB, I managed to free up 9.4GB (one day I’d like to understand where the all of my 32GB of space allocated to the HA VM actually went…) but a manual purge after upgrading to 2024.8.1 still failed due to insufficient disk space:

Error executing query
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 924, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: database or disk is full

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/helpers/recorder.py", line 101, in session_scope
    yield session
  File "/usr/src/homeassistant/homeassistant/components/recorder/migration.py", line 2515, in rebuild_sqlite_table
    session.execute(text(f"DROP TABLE {orig_name}"))
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 2351, in execute
    return self._execute_internal(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 2245, in _execute_internal
    result = conn.execute(
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1418, in execute
    return meth(
           ^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/sql/elements.py", line 515, in _execute_on_connection
    return connection._execute_clauseelement(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1640, in _execute_clauseelement
    ret = self._execute_context(
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1846, in _execute_context
    return self._exec_single_context(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1986, in _exec_single_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 2353, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 924, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database or disk is full
[SQL: DROP TABLE states]
(Background on this error at: https://sqlalche.me/e/20/e3q8)
-rw-r--r--    1 root     root     8989638656 Aug 10 19:41 home-assistant_v2.db
-rw-r--r--    1 root     root      13860864 Aug 10 19:41 home-assistant_v2.db-shm
-rw-r--r--    1 root     root     7124443136 Aug 10 19:41 home-assistant_v2.db-wal

I still however have 2+ GB of free space:

5f2e27d3aa9c:/data/homeassistant# df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                  34.7G     30.6G      2.6G  92% /

to be clear, the shell is running on the hassio_supervisor container, connected via the portainer addon.

After reboot, I go back to 9.3GB of available space:

-rw-r--r--    1 root     root     8989978624 Aug 10 19:46 home-assistant_v2.db
-rw-r--r--    1 root     root         32768 Aug 10 19:47 home-assistant_v2.db-shm
-rw-r--r--    1 root     root          8272 Aug 10 19:47 home-assistant_v2.db-wal

Filesystem                Size      Used Available Use% Mounted on
overlay                  34.7G     24.0G      9.3G  72% /

Your database is contained in the 3 files you listed above: 16127942656 bytes combined

Which means you could need as much as 15.02GiB free.

There is likely some duplication in the WAL which will go away after a repack, so you probably don’t need exactly that much, but 9.4GB isn’t likely to be enough.

I’ve edited my post - the 15GB combined size was a result of the purge attempt, after a reboot I’m back to the DB being (barely…) smaller than my available disk space.

If you have any pointers as to how to free up the rest of the 24GB I genuinely cannot find, I’ll be grateful. I have a different Linux system I can use to purge the DB offline but I honestly don’t know how to extract/import the DB from my HA instance (VM on Proxmox) without it actually running…

If the migration has completed successfully, and you have enough free disk space, repack the database with the recorder.purge action (with repark turned on) to reduce size. This will combine the WAL file into the main DB file and clean up some of the duplication.

Cleaning up your disk usage beyond the database is outside this thread’s scope, and it would be better to start a new thread for that.

Will the migration try to run automatically after a reboot (once on 2024.8.1) or do I need to initiate it with the recorder.purge service action? Because that’s what I did and how I ended up with the error message.

EDIT: I see I’m (almost) out of space again so I assume it starts automatically.

Is there any way to tell if this is working? I ran a purge with repack enabled, and gave myself +40 GB to my VM thinking I would be okay since the DB is only 19 GB, but it created a wal file that is larger than the DB, and my storage is almost maxed out again. Did I not give it enough?

image

Or should I reboot, give more storage, and try again?

Edit: I left this going for like 2 hours, and now I’m down to 73% storage free. So I guess I cut it close, but it worked? The wal file is still present. I’m guessing it will disappear on reboot?

Edit edit: it disappeared.

1 Like

Final update from me. Back to normal. Changed my purge_keep_days back to default by commenting it out of my .yaml file. Was set to 45 days. Now that 24 hours has ticked by and an auto purge ran, I am successfully back to normal. From 26 GB down to 4 GB.
image

2 Likes

Enabling debug logging for recorder.migration will give you logs detailing the schema upgrade/migration process:

logger:
  logs:
    homeassistant.components.recorder.migration: debug

On my system the backup directory in homeassistant took up 20GB, you might want to check it.

Other method is to do a du -sh * in the root directory to see how much space each folder takes and drill down from there.

This 2024.8.* core update ruined my entire weekend.

My issue must be #2 as outlined in OPs edit. I wiped my system, started clean, did a million things to try and resolve these stupid issues.

RPi 4, power over ethernet, External SSD with 500gb space, running HAOS, thousands of entities, local video feeds over RTSP… and yes, a lot of custom integrations. Was running solid for well over a year without issue. Then since upgrading my host just randomly started becoming unresponsive multiple times per day, and the only way to get it back was pulling the ethernet cord to reboot.

Insanely frustrating. Seemed like as long as I didn’t touch anything on the UI, the host would stay alive for quite a while (at least a few hours). But once I started moving around the UI, pressing a button on one of my dashboards, etc… it was seconds or minutes before I get the stupid “Connection lost” toast and the whole process started over again. I went full “burn it down” mode and deleted my database files, stored images, filtered out a ton of entities from the recorder… anything I could think of to reduce load on the system and stop it from hanging. Nothing.

I ended up using the terminal to roll back to 2024.7.1 after fighting with it all weekend, and whaddya know — zero hangs.

Works completely fine now, though I had to go and fix the “Action” vs “Service” syntax across a bunch of automations.

I’m going to give it a lot of time before I open up this core update can of worms again. Maybe I’ll get forced into getting a proper NUC to handle the poor optimization.

Ugh. What a waste of a weekend.

1 Like

2024.8 caused one automation to “go nuts” for me. It’s trigger is “timer.cancelled” and i got error that it was ran 4096 times at once by this trigger( then system crashed). This very automation ran perfectly over 2 years. Now i’ve had to change trigger…
Very strange…

This 2024.8 upgrade is such a shitshow. HELP :confounded:

My home-assistant_v2.db is 7.1GB, due to this bug. Today I started upgrade and it seems I ran out of space at some point. I was looking for a way to re-trigger the upgrade process but there doesn’t seem to be any.

So went back and restored my backup. Lost ten hours of data.

Saw the PSA on the disk space and generously added 20GB. More than twice as much as should be needed. Now I am running the upgrade again for over 5 hours and home-assistant_v2.db-wal grows out of bounds. It’s already at 10GB.

Getting errors about system load too high. Logs are filled with timeouts, connection errors etc.

Already deleting media but im afraid will be running out of space when theres not more to delete. And cant add storage without reboot

Where can i see the status/progress of this database upgrade?

How do I know it is ended/successful?

I don’t use the internal backup at all, I run regular backups of the entire VM, plus snapshots before any major version upgrade. In the end I had to free up 14GB of disk space for a 9GB DB to successfully go back to the ~3.5GB it was prior to the bug introduced in 2024.7.0.

To clarify:

Doing the inserts in batches with commits would save on lock space. I agree it would slow down smaller databases but would actually speed up larger ones due to processing smaller commits. A more sophisticated approach would be to count the rows and based on a threshold, do it in one set or in batches.

I should have been clearer. I would not drop the States table until the end. By renaming it, I would have it until I was certain the new table was complete. It is just a pattern that I follow.

I have always thought I could not help with Home Assistant development. That is because I work mostly in back ends with tools like SQL. If my skills can be of any help, I am happy to jump in.