I have had very strange crashes of the whole HA Debian server (supervised). It leaves no trace of any error and it happens about every other week. Can you propose a way to analyze this? Maybe this is a faulty motherboard or memory. I have tried to run some hardware analyzers, but have found no errors. When error happens, the system is totally stuck and the processor temperature rises after crash from steady 42C by at least 20 degrees. The logs show absolutely nothing. Here is a part of the syslog:
Mar 24 12:24:15 orava 2b1e7eabb187[584]: 12:24:15:623 ZCL attribute report 0xD0CF5EFFFEF7C70D for cluster: 0x0006, ep: 0x01, frame control: 0x08, mfcode: 0x0000
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
Mar 24 13:07:32 orava kernel: [ 0.000000] microcode: microcode updated early to revision 0x28, date = 2019-11-12
12:24 = crash – 13:07 = hard boot by pressing the button
Any ideas how to approach this?
That rise in temperature suggests a tight loop, CPU busy doing something.
Some questions, if you don’t mind…
Are you running headless? During the hang situation, is it possible to switch on the screen and open a terminal window, or can you connect with SSH or VNC?
Maybe after running for some time, the journal and all those logs etc in TMPFS drives (in RAM) depletes your memory, and you run into problems.
Did you perhaps disable swap space by any chance?
Do you use “log2ram”?
Can you check your available memory, and swap usage, using e.g. free. Perhaps you can do this frequently over time (e.g. hourly?), and append the output to file to see if there are any trends that suggests memory leaks of some kind.
Is there any correlation with some (scheduled/regular) activity that takes place every other week, or around the time it happens? Backups, database purge, or even something more distant like WiFi issues (or your RPi connected using LAN?)
I noticed that and was impressed by @smoltron 's machine naming
Back on topic, I’m facing a similar situation and have been pulling hairs for the past couple of months.
My last uptime stat was 3 weeks before it crashed so I took the following measures to investigate
Run memtest and prove it’s not a memory issue (in my case memtest failed despite the fact that the RAM is healthy which pointed me in the direction of a hardware/BIOS issue)
Installed kernel crashdump linux-crashdump which should dump any info into /var/crash in the event of further hang ups
Uninstalled all peripherals - I have been using a DVB-S2 pci-e with third party drivers so that was first to go
My problem appears related to the BIOS and potentially c state handling / microcodes so I downgraded the BIOS to pre spectre/meltdown days
Disabled all spectre mitigations via grub (to rule out microcode shenanigans)
Yes, this is headless. After crash the computer is fully unusable. No sign of life until a hard reset.
I have swap and 4 Gb memory. It should suffice. I do not use log2ram.
Here is “free”
total used free shared buff/cache available
Mem: 3910988 1540276 944972 23088 1425740 2228888
Swap: 999420 0 999420
There is no scheduled activity that can be connected to this. My i3 Debian is connected by wire to LAN.
Orava is of course “squirrel” in English. It is the name of the computer.
Here is the hardware:
Please keep us up-to-date of your progress. And thank you for the instructions. I will try similar over the weekend. My bios is dated 2016 and maybe there are no spectre changes in it? Do you think disabling C states could help?
Yep I wouldn’t change your BIOS just yet as it pre dates spectre
Not sure about C states, when I downgraded the BIOS I restored all the settings including the C states but despite the same settings my cpu frequency scaling changed as per the graph - I’m hoping this is the cause of my woes. Will keep you posted
I fixed this problem by replacing components one at a time. First, I replaced memory. No effect, the system still crashed about once a week without any error message. Second, I replaced the WD SSD disc with Samsung EVO 870. That seems to help. The system has now been running for a month without any problems.