NFS backup has started to fail with partial writes

Running HAOS with core 2025.11.3 and supervisor 2025.12.2 on RPi4

My backups to nfs have been failing last few days and trying to debug. What is happening is I’m getting partial files written to the nfs mounted drive.

I’ve rebooted my HA server a couple of times over the last day.

I can mount the nfs server from my Mac laptop w/o any problem and, for example, copy a full backup file (1.8g) from the nfs server to my laptop over wifi.

Back on the HA machine, I can attach to the supervisor container and see the contents of the nfs share. Then if I try to copy a recent backup it runs for a while then stops:

➜  ~ docker exec -it hassio_supervisor bash
842cbd8c22fd:/# mount | grep nfs
macha.local:/home/ha/ha_nfs_backup on /data/mounts/Macha_Backup type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,timeo=200,retrans=2,sec=sys,clientaddr=192.168.0.142,local_lock=none,addr=192.168.0.253)

842cbd8c22fd:/# ls /data/mounts/Macha_Backup
Automatic_backup_2025.11.3_2025-11-28_05.31_17004318.tar
...

Then if I start a copy to the nfs share it runs a while and then stops:


842cbd8c22fd:/data/backup# cp Automatic_backup_2025.11.3_2025-12-05_05.23_08004568.tar /data/mounts/Macha_Backup
^C

^C

And the transfer isn’t complete:

ha@macha:~/ha_nfs_backup$ ls -lt | head -5
total 10084016
-rw-r--r-- 1 nobody nogroup   18874368 Dec  5 12:42 Automatic_backup_2025.11.3_2025-12-05_05.23_08004568.tar
-rw-r--r-- 1 nobody nogroup  872415232 Dec  3 06:22 Automatic_backup_2025.11.3_2025-12-03_05.08_42004201.tar
-rw-r--r-- 1 nobody nogroup 1937653760 Dec  2 05:36 Automatic_backup_2025.11.3_2025-12-02_05.30_37004745.tar
-rw-r--r-- 1 nobody nogroup 1939107840 Dec  1 05:19 Automatic_backup_2025.11.3_2025-12-01_05.14_23005396.tar

And if I attach again to the supervisor container I can see cp in the process list, but unable to kill it:

842cbd8c22fd:/# ps | grep cp
  318 root      0:06 [cp]
  423 root      0:00 grep cp

42cbd8c22fd:/# kill -9 318

842cbd8c22fd:/# ps | grep cp
  318 root      0:06 [cp]
  425 root      0:00 grep cp

Sadly, nothing in the host or supervisor logs.

Oh, dmesg on the HAOS host does show something perhaps related.

[ 9546.832738] INFO: task cp:13992 blocked for more than 120 seconds.
[ 9546.832831]       Tainted: G         C         6.12.47-haos-raspi #1
[ 9546.832876] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 9546.832922] task:cp              state:D stack:0     pid:13992 tgid:13992 ppid:13838  flags:0x0000020d
[ 9546.832948] Call trace:
...

I’m not seeing any hardware issue in dmesg (e.g. like SSD errors).

Any suggestions on how to debug further?

My drive:

# parted -l
Model: SanDisk Extreme SSD (scsi)
Disk /dev/sda: 1000GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags: pmbr_boot

Number  Start   End     Size    File system  Name              Flags
 1      1049kB  34.6MB  33.6MB  fat16        hassos-boot       msftres
 2      34.6MB  59.8MB  25.2MB               hassos-kernel0
 3      59.8MB  328MB   268MB                hassos-system0
 4      328MB   353MB   25.2MB               hassos-kernel1
 5      353MB   622MB   268MB                hassos-system1
 6      622MB   630MB   8389kB               hassos-bootstate
 7      630MB   731MB   101MB   ext4         hassos-overlay
 8      731MB   1000GB  999GB   ext4         hassos-data

I’m having the same issue. I spent the better part of a 1/2 day trying to figure out what was happening trying all sorts of ways to work around it. I found that by turning off backup encryption for the nfs location now allows the backup to complete. I also create backups locally and was able to leave encryption on for that location.

Interesting. I already had encryption turned off for my nfs backup. It’s something odd in HAOS as I can copy a backup to my laptop, connect to the nfs drive and copy it to the nfs mount no problem from my laptop.

I’ll have to try again. I’ve reduced my backup size lately (I had made backups of the DB in config directory and those added to the backup size.)

I just couldn’t see any errors in any of the logs, but need to try again.

I just started to look at this again.

There’s two issues I’m looking at:

The first one I’m kind of ignoring for now and that is the HA “Repair” feature isn’t reconnecting, but if I reconfigure the NFS backup then HA connects. But. that’s for later.

The second problem I’m seeing is that HA is using IP6 (at times[1]) to connect, and the NFS server is rejecting that.

Note that it’s including the IP6 scope (%eth0) in the match:

2026-01-10T09:08:27.452013-08:00 macha rpc.mountd[17501]: refused mount request from fe80::35ae:e8b8:6425:a2b%eth0 for /home/bill/ha_nfs_backup (/home/bill/ha_nfs_backup): unmatched host

HAOS does broadcast its addresses and the NFS server can resolve them:

$ avahi-resolve -4 -n homeassistant.local && avahi-resolve -6 -n homeassistant.local
homeassistant.local	192.168.0.70
homeassistant.local	fe80::35ae:e8b8:6425:a2b

$ ping -6 fe80::35ae:e8b8:6425:a2b
PING fe80::35ae:e8b8:6425:a2b (fe80::35ae:e8b8:6425:a2b) 56 data bytes
64 bytes from fe80::35ae:e8b8:6425:a2b%eth0: icmp_seq=1 ttl=64 time=0.367 ms

So, adding that match in /etc/exports seems to have addressed it.

$ sudo exportfs -s
/home/bill/ha_nfs_backup  homeassistant.local(sync,wdelay,hide,no_subtree_check,sec=sys,rw,secure,root_squash,no_all_squash)
/home/bill/ha_nfs_backup  fe80::35ae:e8b8:6425:a2b%eth0(sync,wdelay,hide,no_subtree_check,sec=sys,rw,secure,root_squash,no_all_squash)

[1] I say “using IP6 at times” because mounting the NFS share from HAOS command line succeeds about half the time. I have to assume that’s because it is then using IP4 to connect.

What I’d like to understand is why HAOS is seemingly at random attempting IP6 or IP4, and when it failed on IP6 it didn’t then try IP4. Any network experts explain what’s happening there?

Next: Why is something similar happening with IP6 when trying to rsync from HA to another machine…

Have to wonder about the resolver:

debug1: OpenSSH_10.0p2, OpenSSL 3.5.4 30 Sep 2025
debug1: Reading configuration data /config/zwave_backup/ssh_config
debug1: /config/zwave_backup/ssh_config line 1: Applying options for gate
ssh: Could not resolve hostname gatepi3.local: Name has no usable address
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(232) [Receiver=3.4.1]
Exited with 255 at Sat Jan 10 10:14:53 PST 2026

➜  zwave_backup ping gatepi3.local
PING gatepi3.local (192.168.0.51): 56 data bytes
64 bytes from 192.168.0.51: seq=0 ttl=64 time=6.831 ms
64 bytes from 192.168.0.51: seq=1 ttl=64 time=7.020 ms
^C

Looks like I’ll be hard-coding the ip address.