Stability problems NVMe on RPI 5

I would recommend trying to move the RPi to a different location, different outlet. Adding to my previous experience, after moving mine to a quadruple wall outlet (with only 3 devices connected) it was working without issues. A couple days ago I connected a 4th device and the RPi started showing the same errors.

Maybe it’s not the same issue in your case, but it might be worth testing such an easy fix.

I run HomeAssistant on a qemu VM with 10GB memory, 8 cores allocated to it. It ran fine for months, but had a power loss event at home and then upgraded HomeAssistant to the October release and now I have issues too. Not sure whether its the power loss or the upgrade that caused the issues unfortunately.

Update: I just ran dmesg on my host and I see that BTRFS has detected some corruption unfortunately a few hours ago. Appears unrelated to the HAOS upgrade and the power loss. It’s a 6-7 year old SSD. Guess that’s that.

same here, bought a Pi 5 8 GB with PCIe M.2 NVMe Set , with a Patriot P320 128GB SSD.
Initially, lots of stability issues and crashes, async page faults on erofs.
What seems to work for me is to add the following parameters to the haos /mnt/boot/config.txt

# specific for nvme
dtparam=pciex1
dtparam=pciex1_gen=3
dtparam=pciex1_aspm=off
dtparam=pciex1_msi=off
dtoverlay=cma,cma-96

As suggested by Gemini.
Gen2 should be safer but when I change _gen back from 3 to 2, problems reappear.
System is stable sofar, but let’s see what happens after a week or so…

3 Likes

Hi there, I also had a Geekworm X1001 setup with different (also said to be) certified Nvme chipsets. It always created system crashed erratically.

I finally swapped the Geekwork HAT and Nvme by the “original” Raspberry HAT+ and an “original” Raspberry Nvme. Since then no problem.
Hoewever, HAOS never gave me a hint in any log on what the reason for crashes were. Running Raspi remotely via SSH or web interface.

Hi,

I think I have the same problem.

I have been using a GeekPi Starter Kit (RPi5 8GB) including their SSD HAT, together with a Samsung 980 Pro 1 TB since 07/2024. Weird crashes started very early last year right after the initial setup.I Could never find out what was really the issue. Since the frequency was very low, maybe once a month and a reboot solved everything, I did not bother abouth it…

If the problem occured, the frontend continued working but some functions just failed, like changing configs, showing plugins etc. Also restarting via the FE failed because it complained that the configuration.yaml was missing. However, all my zigbee stuff (using a Sonoff Zigbee 3.0 USB Dongle Plus) continued working. In the end I often just had to hard reset the PI.

Last week HA failed again in the same way but did not come up again even after several boots. This was new. So I connected a display to it the very first time, seeing all the I/O errors.

I disassembled my RPI5 housing, reconnected/cleaned all the ribbon cables from the HAT, tested it for a couple of hours which worked perfectly - but only on my desk without the metal housing.

I thought the problem was gone, put everything back in the housing and the RPi back to its original location. And guess what, the issue was right back … it did not even boot up.

Frustrated I bought a new SSD (Kingston 500GB) and reinstalled everything today from a backup.The system is running now since 1-2 hours but I am i doubt if this will fix the issue (after reading all your posts).

I will report any updates here … fingerscrossed

Two days later. RPI crashed over night. Again, nvme issues. This time permanent and HA won’t boot.

I guess i will get rid of this HAT. The housing and HAT has good Reviews on amazon but there are some 1 and 2 star reviews complaining about nvme incompatibilities.

Update from my side too. I wasn’t able to get rid of the issue either, ended up moving to a mini pc (same price as my rpi setup). Now running HAOS on proxmox flawless.

Uh… thats sad…

What HAT were you using with your Pi?

I have this case: GeeekPi Argon Neo 5 M.2 NVME PCIE Case for Raspberry Pi 5, Raspberry Pi 5 Aluminum Case with Built-in Fan (No Raspberry Pi and NVMe SSD) : Amazon.es: Computers

It’s actually quite good, but seems to be not enough. If I were to test further I would try with a bigger power supply. My hypothesis is that the official one is not enough for the ssd, the fan and the zigbee dongle. But maybe I’m wrong.

This could be possible.
Maybe the Power supply cannot Delivery Spikes/surges as the SSD requires it. This might be buffered by a better designed HAT using caps.

But I am only guessing. Ordered the HAT+ and the official RPI SSD today. Another 100€ but If it fixes my issues I am fine…

Please let us know if it works!

My setup is running for one week+ now without any problems (using the original HAT+ and their NVMe + original PSU). This is not a real proof but my other setup (regardless which NVMe I used) had problems after 2 days. I will give you another update in a couple of days.

The sad thing is, the money I invested (85€ for another NVMe - I thought my first one was broken - the original HAT+ and NVME 110€, so almost 200€) was enough to buy a MiniPC which I guess is a lot more stable than this setup.

Yes, exactly my thoughts when I switched to the MiniPC… Thanks for your feedback though, we are learning something here.

FWIW I had a similar issue and whilst I cannot say for certain it is solved, I have not had the issue for roughly a week.
I use a WD Black SN770M, which I lately realised is “not supported”. Currently it seems to be ok. I used to have the issue every 48Hrs or so and I had set up a monitoring daemon on my Frigate server to power cycle the HA box via a Tapo smart switch, but it seems I no longer need that.

What I did:

  • Made sure I was on latest firmware (I was)
  • Uodated /mnt/boot/config.txt to what @MichielfromNL suggested, i.e.
# specific for nvme
dtparam=pciex1
dtparam=pciex1_gen=3
dtparam=pciex1_aspm=off
dtparam=pciex1_msi=off
dtoverlay=cma,cma-96
  • And I purchased a new power supply, an Aukey PA-C5

Not sure if only one would have sufficed but it seems to work for me now. Might even try to power my Frigate server from the Aukey (OrangePi 5)
I have both Zigbee and Z-Wave connected via USB to my HA RPi 5.

2 Likes

Hi,

I ran into similar issues with a Patriot P300 256 GB SSD. I read this interesting thread and came up with the feeling it was a Gen2/Gen3 compatibility issue for 2 reasons:

  • Most NVMe SSD nowadays are Gen3, except the Raspberry foundation one
  • Several of you reported that the problem stopped when moving to the Raspberry foundation NVMe SSD

I decided to update the boot config to define I wanted to use Gen3 rather than Gen2 (the default) and added only:

dtparam=pciex1_gen=3

Since then my Rasperry Pi 5 is running like a charm, no issue since 3 days (before it was a matter of hours) with a rather intensive use of the Raspberry to configure Home Assistant.

Personally I’d advise to remain minimalistic in the boot config update and avoid to define kernel parameters to non-default values (through dtparam) if it is not needed. Also note that defining both pciex1 and pciex1_gen is not necessary as the latter overrides the former.

A possible power adapter issue was mentioned but AFAIK it doesn’t really make sense in a small configuration (if you don’t have many devices connected to the Pi) : 27W should be far enough for the Pi 5 and a SSD…

1 Like

NVMe, Argon, SSD, M.2, Raspberry Pi are all trigger words that invoke system and disk drive instability and incompatibility concerns.

I think the original underlying issue was a firmware compatibility in the controller, and so a lot of Argon cases with inbuilt HATs were unloaded at special prices to unsuspecting hobbyists.

The underlying issue is hardware incompatibility, often misdiagnosed or in combination with power supply issues, extremely hard to diagnose from system logs as the fault is at the hardware level, and system logs offer no clues as the system locks up, hard, before anything is written.

There have been a number of people that have done comprehensive tests, and good/bad combination lists are available if you are aware and research this issue online.

The problem is still out there, often misdiagnosed so bypassed by replacement of one of the incompatible items of hardware, and the problem is resolved without the innocent user none the wiser, and slightly poorer for the exercise.

The real solution is heightened awareness, bad publicity and a product recall.

1 Like

My setup is now running for four weeks without any issues.

Original raspi HAT and NVMe…

This is a very intresting thread!

I’m running HAOS 15.2 on a RPi 5 8GB installed in a argon neo nvme case with the original Rpi 27W power brick.

On HAOS 15.2 i have an extremely reliable and stable performance. yet ever since 16.0 it has been nothing but trouble. reading this thread it seems that many have or have had the same issues I have when upgrading to 16.1 16.2 16.3 and 17.0 although I have not tried any of the mentioned ideas from here.

Nevertheless I fail te understand the correlation. Meaning if everything is smooth in 15.2 what changed in 16. and up that would cause this instability.

For me the first thing was that HAOS 16 and 17 kept crashing. I wiped the SSD clean installed HA new and restored from Backup. That seemed to be better but then HA started doing random restarts. Which was also not desirable as my espresso machine is turned on by an automation in the morning and jsut at that time Home Assistant was doing another auto restart.

After repairing my sqlite3 db and losing much of my LT data this did not resolve the issue. So i’ll ry to have a look in the config.txt tomorrow.

in any case great I found this thread

Do you have any logs from your RPI? Usually it will complain about nvme issues. In my case it was hundreds or thousands of lines spamming the logs. I was only able to see them after connecting a display to the RPI since some messages already appeared during boot time.

For me I had the first issues right after my first setup, mid 2024. However, for some reason these effects became massive or even permanent by the end of 2025. I am not sure if this correlates with specific HAOS versions. As soon as I found out it is most likely a HW issue I stopped trying to further understand the exact reason and switched HW. Never had any issues after that… i guess there are just some incompatible HW/SW combinations out there. As others already reported the pcie version seems to make a difference.

Let us know what helped and what not. I am keeping my fingers crossed for your espresso machine!

I have not be so lucky as @dan-shaqfu : I continue to struggle with RPI 5 instabilities due to I/O errors on the NVMe. My configuration is:

  • a RPI 5 + official 27W Power Supply
  • Geekworm X1001 HAT
  • Patriot P320 256 GB NVMe SSD (M.2 2280)

As I said in a previous post, using Gen 3 rather than Gen 2 (`dtparam=pciex1_gen=3) helps but is not enough to get a stable system. The issue continues to happen every few days/week but it is an improvement over Gen 2 where it happens every 30 minutes/1 hour…

I also decided to add dtparam=pciex1_aspm=off which disables the Active State Power Management at the bus level. May result in a slightly increase power consumption but should only help… Not clear in my case if there is a real impact…

I decided not to go with dtparam=pciex1_msi=off (move back to legacy/traditional interrupts) as it should not have any impact and I try to remain minimalistic in my changes. I also didn’t add the dtoverlay=cma,cma96 as CMA (Contriguous Memory Allocator) is not used by Home Assistant and in particular not by NVMe SSD. BTW, cma-96 increases the CMA buffer from the default 64K and it is useless as it is not used. It can be checked with:

cat /proc/meminfo | grep -i cma

I contacted Patriot support and received quite detailed technical answers which is good, even if has not helped yet to reach a stable situation. In the discussion, they raised that it may be connected to some transient situations when moving between “power states” (used to saved energy) which involves some latency that may be interpreted as I/O error (not responding device). One critical kernel/driver parameter to control this seems to be nvme_core.default_ps_max_latency_us. According to ChatGPT (the only detailed source I found about this parameter), it says how long you are ready to wait for a power state transition (when you need to wake up the SSD) or said differenly, how deep-sleep the NVMe can go. The current value can be checked with:

cat /sys/module/nvme_core/parameters/default_ps_max_latency_us

In my case (Home Assistant 2026.3 which means HA0S 17 if I’m right), the value is 100000 (one hundred thousand) which is a huge value. ChatGPT suggests that 5000 is already a very big value. Lowering the value means that the NVMe will go in less deep sleeps and setting it to 0 disables power state transitiions. I may explore changing this parameter but I resist to change too many parameters at a time so I will wait the next occurence of the problem… I’m interested by comments about exploring this path.

Another potential issue mentioned by Patriot support is a quality problem in the HAT or the FFC cable. I’ve the feeling that the X1001 is good alternative to the official HAT (that does not support M.2 2280 if I’m right) with generally good reviews but it is probably not enough to conclude… I double-checked the FFC cable insertion in connectors so I’d tend to exclude a problem at this level. Probably a problem introduced by the HAT (some signal reflection in lines for example) may explain a greater sensitivity to deep-sleeps.