Very strange crash of HA Debian - faulty hardware?

smoltron · March 24, 2021, 1:35pm

I have had very strange crashes of the whole HA Debian server (supervised). It leaves no trace of any error and it happens about every other week. Can you propose a way to analyze this? Maybe this is a faulty motherboard or memory. I have tried to run some hardware analyzers, but have found no errors. When error happens, the system is totally stuck and the processor temperature rises after crash from steady 42C by at least 20 degrees. The logs show absolutely nothing. Here is a part of the syslog:

Mar 24 12:24:15 orava 2b1e7eabb187[584]: 12:24:15:623 ZCL attribute report 0xD0CF5EFFFEF7C70D for cluster: 0x0006, ep: 0x01, frame control: 0x08, mfcode: 0x0000
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
Mar 24 13:07:32 orava kernel: [    0.000000] microcode: microcode updated early to revision 0x28, date = 2019-11-12

12:24 = crash – 13:07 = hard boot by pressing the button
Any ideas how to approach this?

JoFie · March 24, 2021, 2:02pm

That rise in temperature suggests a tight loop, CPU busy doing something.
Some questions, if you don’t mind…

Are you running headless? During the hang situation, is it possible to switch on the screen and open a terminal window, or can you connect with SSH or VNC?
Maybe after running for some time, the journal and all those logs etc in TMPFS drives (in RAM) depletes your memory, and you run into problems.
- Did you perhaps disable swap space by any chance?
- Do you use “log2ram”?
Can you check your available memory, and swap usage, using e.g. free. Perhaps you can do this frequently over time (e.g. hourly?), and append the output to file to see if there are any trends that suggests memory leaks of some kind.
Is there any correlation with some (scheduled/regular) activity that takes place every other week, or around the time it happens? Backups, database purge, or even something more distant like WiFi issues (or your RPi connected using LAN?)
What is “orava kernel”?

_dev_null · March 24, 2021, 2:16pm

I noticed that and was impressed by @smoltron 's machine naming

Back on topic, I’m facing a similar situation and have been pulling hairs for the past couple of months.
My last uptime stat was 3 weeks before it crashed so I took the following measures to investigate

Run memtest and prove it’s not a memory issue (in my case memtest failed despite the fact that the RAM is healthy which pointed me in the direction of a hardware/BIOS issue)
Installed kernel crashdump linux-crashdump which should dump any info into /var/crash in the event of further hang ups
Uninstalled all peripherals - I have been using a DVB-S2 pci-e with third party drivers so that was first to go
My problem appears related to the BIOS and potentially c state handling / microcodes so I downgraded the BIOS to pre spectre/meltdown days
Disabled all spectre mitigations via grub (to rule out microcode shenanigans)

GRUB_CMDLINE_LINUX_DEFAULT="mitigations=off"
#GRUB_CMDLINE_LINUX_DEFAULT="dis_ucode_ldr mitigations=off"

Cross your fingers and toes - I’m at 1 week uptime and need to beat the last uptime record of 3 weeks

smoltron · March 24, 2021, 3:17pm

Yes, definitely, but what?

Yes, this is headless. After crash the computer is fully unusable. No sign of life until a hard reset.

I have swap and 4 Gb memory. It should suffice. I do not use log2ram.
Here is “free”

              total        used        free      shared  buff/cache   available
Mem:        3910988     1540276      944972       23088     1425740     2228888
Swap:        999420           0      999420

There is no scheduled activity that can be connected to this. My i3 Debian is connected by wire to LAN.
Orava is of course “squirrel” in English. It is the name of the computer.
Here is the hardware:

kari@orava:~$ sudo inxi -v 2
System:    Host: orava Kernel: 4.19.0-14-amd64 x86_64 bits: 64 Console: tty 0 Distro: Debian GNU/Linux 10 (buster) 
Machine:   Type: Desktop System: ASUS product: All Series v: N/A serial: N/A 
           Mobo: ASUSTeK model: H81T v: Rev X.0x serial: 140728184800342 UEFI: American Megatrends v: 0807 date: 04/29/2016 
CPU:       Dual Core: Intel Core i3-4130T type: MT MCP speed: 799 MHz min/max: 800/2900 MHz 
Graphics:  Device-1: Intel 4th Generation Core Processor Family Integrated Graphics driver: i915 v: kernel 
           Display: server: No display server data found. Headless machine? tty: 170x45 
           Message: Unable to show advanced data. Required tool glxinfo missing. 
Network:   Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet driver: r8169 
Drives:    Local Storage: total: 111.79 GiB used: 25.23 GiB (22.6%) 
Info:      Processes: 234 Uptime: 3h 50m Memory: 3.73 GiB used: 1.59 GiB (42.6%) Init: systemd runlevel: 5 Shell: bash 
           inxi: 3.0.32

Please keep us up-to-date of your progress. And thank you for the instructions. I will try similar over the weekend. My bios is dated 2016 and maybe there are no spectre changes in it? Do you think disabling C states could help?

_dev_null · March 24, 2021, 4:19pm

Yep I wouldn’t change your BIOS just yet as it pre dates spectre

Not sure about C states, when I downgraded the BIOS I restored all the settings including the C states but despite the same settings my cpu frequency scaling changed as per the graph - I’m hoping this is the cause of my woes. Will keep you posted

smoltron · June 22, 2021, 7:12am

I fixed this problem by replacing components one at a time. First, I replaced memory. No effect, the system still crashed about once a week without any error message. Second, I replaced the WD SSD disc with Samsung EVO 870. That seems to help. The system has now been running for a month without any problems.

_dev_null · March 12, 2022, 1:19am

Yep mine was the same a pcie satellite card that I removed fixed my random lockups

Could be the card but more likely the unsupported driver.

We live and learn