Home Assistant Yellow PCIe bus errors & reboots

I have a HA yellow PoE with an SSD (Seagate ZP500GM3A023), and it exhibits groups of errors w/ reboots like those below frequently, up to several times an hour. I’ve tried removing the CM4 and SSD and re-seating them, which hasn’t helped. It should have plenty of PoE power available: it’s currently drawing 2.7W, and only 24W of the 40W budget is in use on that switch (Unifi Flex powered by the 60W power supply from the Flex Utility). Other PoE devices on that switch, including a 10W device, work fine. Any ideas on what to try to address this?

Edit: I tried both jumper positions for the PoE power class, with no difference.

Feb 06 13:42:12 hass-house kernel: pcieport 0000:00:00.0: AER: Multiple Corrected error received: 0000:00:00.0
Feb 06 13:42:12 hass-house kernel: pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Feb 06 13:42:12 hass-house kernel: pcieport 0000:00:00.0:   device [14e4:2711] error status/mask=00001100/00002000
Feb 06 13:42:12 hass-house kernel: pcieport 0000:00:00.0:    [ 8] Rollover              
Feb 06 13:42:12 hass-house kernel: pcieport 0000:00:00.0:    [12] Timeout               
Feb 06 13:42:12 hass-house kernel: nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Feb 06 13:42:12 hass-house kernel: nvme 0000:01:00.0:   device [1bb1:5018] error status/mask=00001000/00006000
Feb 06 13:42:12 hass-house kernel: nvme 0000:01:00.0:    [12] Timeout               
Feb 06 13:45:11 hass-house kernel: pcieport 0000:00:00.0: AER: Multiple Corrected error received: 0000:00:00.0
Feb 06 13:45:11 hass-house kernel: pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Feb 06 13:45:11 hass-house kernel: pcieport 0000:00:00.0:   device [14e4:2711] error status/mask=00001100/00002000
Feb 06 13:45:11 hass-house kernel: pcieport 0000:00:00.0:    [ 8] Rollover              
Feb 06 13:45:11 hass-house kernel: pcieport 0000:00:00.0:    [12] Timeout               
Feb 06 13:45:11 hass-house kernel: nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Feb 06 13:45:11 hass-house kernel: nvme 0000:01:00.0:   device [1bb1:5018] error status/mask=00001000/00006000
Feb 06 13:45:11 hass-house kernel: nvme 0000:01:00.0:    [12] Timeout               
Feb 06 13:46:12 hass-house kernel: pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
Feb 06 13:46:12 hass-house kernel: pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Feb 06 13:46:12 hass-house kernel: pcieport 0000:00:00.0:   device [14e4:2711] error status/mask=00001000/00002000
Feb 06 13:46:12 hass-house kernel: pcieport 0000:00:00.0:    [12] Timeout               
Feb 06 13:46:46 hass-house kernel: pcieport 0000:00:00.0: AER: Multiple Corrected error received: 0000:00:00.0
Feb 06 13:46:46 hass-house kernel: pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Feb 06 13:46:46 hass-house kernel: pcieport 0000:00:00.0:   device [14e4:2711] error status/mask=00001100/00002000
Feb 06 13:46:46 hass-house kernel: pcieport 0000:00:00.0:    [ 8] Rollover              
Feb 06 13:46:46 hass-house kernel: pcieport 0000:00:00.0:    [12] Timeout               
-- Boot 57df44e29abf46ae87fb2548f40faf77 --
Apr 04 10:55:23 homeassistant kernel: Booting Linux on physical CPU 0x0000000000 [0x410fd083]
Apr 04 10:55:23 homeassistant kernel: Linux version 6.1.63-haos-raspi (builder@13ed6d6d8021) (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot -g2d89a0f9) 11.4.0, GNU ld (GNU Binutils) 2
.38) #1 SMP PREEMPT Tue Jan  9 10:42:51 UTC 2024

Update with what I’ve tried so far:

  • Put the Yellow in a different building, on a different switch (rack mount). No PCIe errors or reboots seen for a day.
  • In that same building, use a small portable switch (Unifi Lite 8). No issues for a day.
  • Moving the Yellow back to the original building, but powered by that Lite 8, the PCIe errors and reboots are back. So, it’s not power – it must be something environmental. Maybe the device is vulnerable to 2.4GHz or 5GHz RF?
  • In the original building, move the Yellow and Lite 8 to a different location away from the switches, AP, and cameras that it was next to originally. Since there isn’t cable running between the two locations (yet), the Lite 8 also has an AP (power set to low) on it, meshed to the other AP to provide connectivity. We’ll see what happens…
    • Still seeing PCIe errors and reboots.
  • Wrapping the Yellow in a few layers of aluminum foil, except for the Z-Wave and Bluetooth USB sticks, in case it’s some form of RF interference.
    • This seems to have worked. However, that implies that the Yellow is quite vulnerable to RF interference, which isn’t great. Do I have a defective board?

In general, Nabu Casa hired a third party to validate CE conformity. Yellow passed the conformity for immunity according to EN 55035:2017+A11:2020 (see also https://yellow.home-assistant.io/resources/Yellow_DoC_EU.pdf). However, from what I understand, the test runs have been run in the shipped configuration, which was without a NVMe SSD pugged in. Every NVMe SSD might behave different so this is anyways a tricky situation :man_shrugging: . From what I understand, from (European) regulatory point of view, you’d have to retest any recombination of device + SSD NVMe before reselling :see_no_evil: Now in practice usually it is the end user which combines the things, I am not sure what applies there.

A related anecdote: When we went for pre-compliant testing we actually saw that some NVMes exceed radiation limits :see_no_evil: I don’t remember the brand, we had a couple with us and really saw quite different behavior.

To my knowledge we haven’t heard of immunity issues. Also I could imagine that quite some Yellows run in cabinets closely to APs. Actually my production system at home with NVMe runs ~20cm next to a Netgear WAX220 WiFI 6 AP, and I haven’t noticed any issue so far. This is with a Samsung SSD 970 EVO Plus 1TB.

That said, APs run in the same spectrum or multiples of what a typical NVMe runs. The PCIe bus itself is a diff-pair, which is quite robust against interference. But it could be that the AP messes with the NVMe directly. So your observations can indeed be explained by the proximity of the AP.

As mentioned above, technically, the combination needs to sustain interference . But for us, this is impossible to prove that for any combination. I guess a lot of PCs have a metal casing, so the typical operating environment of NVMes is much more shielded than it is in Yellow.

I wonder how a different brand would perform in this environment, e.g. the Samsung SSD 970 which seems to operate well in close proximity to my AP here.

This seems to have worked. However, that implies that the Yellow is quite vulnerable to RF interference, which isn’t great. Do I have a defective board?

If this is indeed radiated noise, it is very unlikely to cause harm. I think you probably just hit a unfavorable combination of circumstances in the end.

The aluminum foil sounds like a good workaround to me. In the end that makes Yellow almost like a full PC case now :sweat_smile:

Thanks for the additional detail. It hadn’t occurred to me that maybe it was the SSD that was responding poorly to noise, but it certainly could be the case. There’s still the confusing detail that in the building where everything worked fine, it was very close to an AP there too, but of a different brand (my ISP’s hardware, not my own Unifi deployment). Who knows…