Researching I/O Write Improvements

Hello,

After having some discussion with @orange-assistant in the Home Assistant OS 8 release thread I started doing some I/O write improvements research. Let’s discuss the results in this new thread.

The main goal is to find if ext4 commit intervals, tuned VM dirty writeback timeouts (vm.dirty_expire_centisecs) and potential other tunables can have a positive effect on writes. The goal is to essentially cause fewer writes to the storage, which should lessen the write amplification effect typically seen on flash storage.

For testing I use a iomonitor.sh script, which is a envolved version of the armbian monitor command found here. My version only looks at the actual write I/O’s merges and bytes written to the main data disk reported by /proc/diskstats.

My test setup is a virtual machine with a vanilla HAOS 9.0 development build, and a vanilla Home Assistant Core 2022.8.7 configuration with SQLite. No add-ons other than the Terminal add-on is installed. I did some Idle measurements as baseline, but those are really not that useful. Most tests run with a single automation which increment a counter every second. This changes the state of an entity every second, which essentially causes the recorder to record a state every second. So this should cause continuous writes.

Note: HA Core recorder by default pools state changes, so SQLite is not causing a transaction every second.

The test is somewhat crude, and mainly represents a simple HA installation. Add-ons (e.g. different database backend) etc. could generate different results. But I think this tests should still be helpful to make a somewhat informed picture on the tunables.

Test I/O Count I/O merge I/O kbytes
Idle, Commit 5s, VM 30s/5s (vanilla) 90 41 453
Idle, Commit 30s, VM 30s/5s 50 42 274
1s Increment, Commit 5s, VM 30s/5s (vanilla) 572 187 31139
1s Increment, Commit 30s, VM 30s/5s 381 99 27316
1s Increment, Commit 60s, VM 30s/5s 415 88 27702
1s Increment, Commit 60s, VM 60s/30s 483 123 28215
1s Increment, Commit 120s, VM 120s/60s 431 101 26994
1s Increment, Commit 600s, VM 30s/5s 475 108 30257

What puzzles me a bit is that higher Commit caused fewer I/O merges. I guess the merge happens at a lower layer, hence when fewer writes hit that lower layer there are also fewer opportunities for I/O merges… In any case, the main value we try to optimize is I/O Count, which is the effective completed I/O operations on the block device.

I’ve looked into iotop, and all writes are basically from Home Assitant Core itself. I’d assume most of them are generated by the recorder.

So my interpretation of the results are the following: The difference is measurable, but not massive. As I wrote in the HAOS 8 release thread, higher levels do some write pooling already, which should also translate to fewer writes even with a 5s commit interval, and this seems to be largely true.

However, we can definitely save some writes. It seems that we have about ~30% fewer writes when increasing the commit interval to 30s. It seems that after 30s there is really diminishing returns to increase the value further. I’d guess that is because the Recorder commits data around this rate.

Thoughts, other ideas?

Best regards,
Stefan

2 Likes

That is already something but is it also measurable to which extend the pages are “filled”?

With a mathematical approach one could argue “before” (default 5s commit) pages were only filled 2/3 and now (30s commit) they are “efficiently” filled and fully written. In reality this will be far from true probably and also only half “full” pages will get written after a higher commit interval.

But in general wearing out the flash cells 30% lesswith the same host/filesystem data written sounds already like a easy win. :+1:

Maybe it’ also fact that HA writes a lot of data? So as “more” data (or a bigger data “stream”) the host system is writing as “faster” a page (typically sizes could be 8 or 16k - or maybe even more nowadays?) is filled. The question could be how long does ha (for example a “vanilla” system) take to have “enough” data together for a full page (write)?

1 Like

Not really, just reading /proc/diskstats doesn’t allow this. It can be estimated by dividing the amount of bytes writen by the I/O’s. In practice the size of the I/O operations probably varies widely. When running the script, you can see smaller and bigger write bursts (this is from commit=600s):

Thu Sep  1 14:29:08 UTC 2022      61      20        1812 kbytes written after 16 sec
Thu Sep  1 14:29:09 UTC 2022       2       0           4 kbytes written after 1 sec
Thu Sep  1 14:29:13 UTC 2022      10       3          52 kbytes written after 4 sec
Thu Sep  1 14:29:17 UTC 2022       2       1           3 kbytes written after 4 sec
Thu Sep  1 14:29:23 UTC 2022       5       0        1200 kbytes written after 6 sec
Thu Sep  1 14:29:38 UTC 2022       4       0        1480 kbytes written after 15 sec
Thu Sep  1 14:29:45 UTC 2022       7       1         621 kbytes written after 7 sec
Thu Sep  1 14:29:54 UTC 2022      46      14        1212 kbytes written after 9 sec

With a mathematical approach one could argue “before” (default 5s commit) pages were only filled 2/3 and now (30s commit) they are “efficiently” filled and fully written.

I mean, we don’t know the underlying page size of the flash storage. The bytes written seems to be rather constant, and that does make sense as the Core runs SQLite in WAL mode (Write-Ahead Logging, essentially causing data to be sequentially applied to the log. Same concept is used by F2FS as well).

But in general wearing out the flash cells 30% lesswith the same host/filesystem data written sounds already like a easy win. :+1:

Sure, and I think we should take it :slight_smile: But it doesn’t seem to me that HAOS is completely missconfigured as you claimed in the other thread :wink: In the end, the writes are caused by the application layer, and they also control when data is been written out to disk (especially databases).

Maybe it’ also fact that HA writes a lot of data? So as “more” data (or a bigger data “stream”) the host system is writing as “faster” a page (typically sizes could be 8 or 16k - or maybe even more nowadays?) is filled. The question could be how long does ha (for example a “vanilla” system) take to have “enough” data together for a full page (write)?

30MiB in 5 minutes seems like quite some indeed. I’ll have to look into HA Core to see what is exactly done there.

1 Like

Actually, checking the recorder documentation, it seems that the default commit interval is set to 1s. With my 1 second automation, and with an ACID compliant database, this should cause a I/O write every second (the Durability aspect of ACID basically means the database needs to write data and issue fsync to make sure it hits the disk). Yet we don’t see a I/O’s every second, how come?

The recorder does insert state changes to the state table and commits the transaction every second, I’ve checked that. However, the current database configuration sets the SQLite’s synchronous flag to NORMAL (see Pragma statements supported by SQLite). This mode which avoids some syncs (and looses the durability guarantee).

WAL mode is always consistent with synchronous=NORMAL, but WAL mode does lose durability. A transaction committed in WAL mode with synchronous=NORMAL might roll back following a power loss or system crash.

@bdraco did that change earlier this year with this PR, and IMHO it is a sensible choice for WAL.

Let’s see how various commit intervals behave on disk I/O level:

Test I/O Count I/O merge I/O kbytes
1s Increment, recorder commit 1s, ext4 Commit 5s, VM 30s/5s* 572 187 31139
1s Increment, recorder commit 1s, ext4 Commit 30s, VM 30s/5s 381 99 27316
1s Increment, recorder commit 30s, ext4 Commit 30s, VM 30s/5s 192 93 3975

(*) current default configuration

That is a 2x improvement in I/O writes, and a big difference in terms of bytes written!

From what I understand the Write-ahead logging of SQLite does not cause writes to collapse much, as the log is appended after every transaction. Now since we issue a database commit every second, this adds up writes (data + transaction overhead). All these transactions eventually hit the disk.

Test I/O Count I/O merge I/O kbytes
1s Increment, recorder commit 0s, ext4 Commit 5s, VM 30s/5s§ 3781 215 55286
1s Increment, recorder commit 1s, ext4 Commit 5s, VM 30s/5s* 572 187 31139
1s Increment, recorder commit 5s, ext4 Commit 5s, VM 30s/5s 373 153 11837
1s Increment, recorder commit 30s, ext4 Commit 5s, VM 30s/5s 203 55 4492
1s Increment, recorder commit 0s, ext4 Commit 30s, VM 30s/5s§ 3607 128 54330
1s Increment, recorder commit 1s, ext4 Commit 30s, VM 30s/5s 381 99 27316
1s Increment, recorder commit 5s, ext4 Commit 30s, VM 30s/5s 300 99 11051
1s Increment, recorder commit 30s, ext4 Commit 30s, VM 30s/5s 192 93 3975

(*) current default configuration
(§) 0s causes the recorder to use synchronous=FULL. With an event every second, this causes a file system sync every second, this also nicely shows up during testing:

Fri Sep  2 09:12:20 UTC 2022      10       0         144 kbytes written after 1 sec
Fri Sep  2 09:12:21 UTC 2022      10       0         172 kbytes written after 1 sec
Fri Sep  2 09:12:22 UTC 2022      10       0         144 kbytes written after 1 sec
Fri Sep  2 09:12:23 UTC 2022      10       0         156 kbytes written after 1 sec
Fri Sep  2 09:12:24 UTC 2022      10       0         212 kbytes written after 1 sec

I did some more tests with logging enabled. Setting log level to info logs every automation execution. In my test case this leads to two logs entries every second. All these tests have done with the same 1s increment automation and recorder commit set to 30s.

Test I/O Count I/O merge I/O kbytes
log level warning, ext4 Commit 5s, VM 30s/5s 203 55 4492
log level info, ext4 Commit 5s, VM 30s/5s 1830 357 12695
log level info, ext4 Commit 30s, VM 30s/5s 1842 130 11821

Most of these writes are generated by the system log systemd-journald. The ext4 commit interval does not really change the write I/O’s, but the log shows that writes are done in bigger batches with increased commit interval. If the blocks written happen to be on the same (flash) page, this can have a positive effect.

With all this testing, I come to the following conclusion:

  1. Make sure you system is not logging extensively. info logging already can cause lots of entries to be written.
  2. Set recorder commit_interval to 5 or higher to decrease writes significantly.
  3. The file system tuning have some effect on write I/Os for the database, but not so much for logging. Higher commit interval than 30s seems to have negligible effects.

Even though ext4 commit interval seems not to have a huge effect, bumping it a bit is probably still a good idea. I created a PR to set the commit interval to 30s as this seems to have good returns while still being on a somewhat conservative level.

1 Like

Also one thing to note on recorder, if there are no changes, we don’t commit. So if you have no state changes or events in a 1s interval period, we avoid the write

1 Like

To close the loop on this research:

2 Likes

FYI

I’ve tried high values for commit_interval like 600, ie. 10 minutes. I’ve logged the table row counts with a script, and made the below Excel chart. Is shows, that the recorder commits in each approx. 3-5 minutes. If this is normal, forget about my comment.

Statistics runs need the data committed before they can run which means the every 5 minutes stats generation will always trigger a commit. A commit interval larger than 5 minutes makes no sense with the current design. We should probably cap that value so its apparent what the upper limit actually is.

Also there are quite a few tasks (mostly stats) that will trigger a commit as well

If the task isn’t set to commit_before = False it has to trigger a commit (which only happens if there are pending writes)

1 Like

Hi @agners
The link for your monitoring script doesn’t seem to work. Could you please fix it? I’d like to do some similar monitoring for HA and it would help a lot. Thanks in advance!

I’ve updated the URL to point to a Gist!

Hi,
I just stumbled upon this thread while trying to get Homeassistant in my configuration to only update the database every 3 hours and keep everything in memory in between. So I set commit interval accordingly but find writes every 5 minutes.
Is it related to the statistics being written every 5 minutes and can I change that behavior to reduce disk writes?
background: I want to bring my NAS HDDs into standby, but this is not possible with HA writing every 5 minutes…

That is really not how memory management and commit intervals are meant to be used.

I suggest to store things locally. If you are worried about data, you can take a backup every couple of hours.

Ah, forgot to mention that HA runs in a docker environment on the NAS, so it does actually run locally :slight_smile: Guess that’s still not the recommendet way to run it…