Help HA Died twice now

Calzor_Suzay · October 24, 2023, 7:10am

Long term user of HA without too much of an issue but recently (Less than a week) it has died twice overnight. To the extent I can no longer ping the box and have to yank the power out to restart normality.

The first time once I had gained access to the website had to then restart it from gui a few times to get all functionality back aka MQTT or Lightwave off of an RFXCOM RFXtrx integration.

This morning it was a plain unplug/replug and after a while it all came up but obviously during this time no lights work, automations are down and generally I’m stranded.

What do I need to do/look at to try and diagnose the issue? (settings - system - logs) just seem to show from boot time. Are there deeper/older logs I can look through?

I haven’t looked at the console aka the TV that’s plugged into it when it has died so far as it’s usually really early and the household needs ‘stuff’ working aka fix it fix it fix it!

Nothing has fundamentally changed of late, just the usual updates to HA & Core etc.
It runs on a Raspberry Pi 4 with an SSD attached (Can I health check this from HA?)
Is there verbose logging I can activate to write to the SSD that I can then retrieve after?

Calzor_Suzay · October 24, 2023, 7:20am

BTW home-assistant.log.1 only seems to show timeline from the reboot and home-assistant.log.fault is empty.

Calzor_Suzay · October 24, 2023, 7:35am

So I found this great post about the journal logs which are the proper logs…
How to get to your log after restart/restore - Community Guides - Home Assistant Community (home-assistant.io)

Currently trying to ‘get’ or read the logs.

Calzor_Suzay · October 24, 2023, 7:57am

Ok I’m stuck…

If I use ha host logs I get logs from 06:19 to now which due to BST is probably when I restarted it.
If I use ha host logs --boot -1 I get logs from 00:09 to 02:19 with no obvious errors.

How do I get 02:19 to 06:19 or is that the problem as it died?
Is my SSD dying hence no logs written?

starob · October 24, 2023, 11:41am

You may not be affected but I had similar problems with USB UASP (USB Attached SCSI Protocol) mode. Some combinations of USB adapters and USAP do not work reliable. It can work for a while but then sporadic crashes appear. USB driver changes in the Linux kernel might also be a reason why suddenly this starts to cause issues.

I disabled UASP to trade SSD transfer speeds for stability. This fixed it. I followed this guide but its in german, sorry. https://www.youtube.com/watch?v=fMlIV6kNzaA&t=586s

Calzor_Suzay · October 24, 2023, 12:13pm

Would that explain the missing four hours of logs?
To be honest I’ve been using the SSD for years with no issues, didn’t even realise there was so many issues till I just bumped into this thread from your info USB Boot on Raspberry Pi 4 - Installation / Home Assistant OS - Home Assistant Community (home-assistant.io)

Time to google some more…

starob · October 24, 2023, 12:28pm

Well it crashes and there is nothing useful in the logs. Since you used your SSD for so long without problems its rather unlikely that you are affected by this. In my case I had crashed after a few weeks.

Calzor_Suzay · October 24, 2023, 12:40pm

Not so much nothing useful in logs I think they are actually missing… I’m not 100% sure if someone can clarify as I posted above.

If they are just missing there’s a chance the drive has this issue I’ve just never encountered it that often.
Over time I’ve had random days where the system is unresponsive and I’ve had to pull the plug in the same scenario but it’s been weeks/months apart and just thought it was a bug. With it happening twice withing one week it peaked my interest or more raised the shout at me factor for stuff not working that got me investigating.

I found this great article on the subject but I don’t seem to have a \boot\ folder to edit the cmdline.txt file. My german is non existent to interpret the other video but I’ll try work it out from the pictures

I’ve also found this add-on called Scrutiny which displays the SMART info of the drive in case it’s just literally on the way out hassio-addons/scrutiny at master · alexbelgium/hassio-addons (github.com)

There’s always the chance it’s nothing to do with the drive and I’m barking up the wrong tree but with the lack of logs (or appears to be) Im a bit stuck.

Calzor_Suzay · October 24, 2023, 1:13pm

How did you edit your cmdline.txt?

I have no /boot folder in the root and /mnt is empty whether I use Terminal inside HA or via SSH.

starob · October 24, 2023, 1:16pm

I did not use HAOS at that time but had a raspbian OS/docker installation so I had a boot directory. Actually I don’t know how this would be done - if at all possible - under HAOS because as you said, there is not /boot directory.

See below.

KruseLuds · October 25, 2023, 12:44am

You are actually lucky to have made it so far. I have a RPI4 (8Gig of RAM) running supervised on a USB SSD (Samsung T7 1TB). It works great except almost every 7 days it just dies. I mean - the CPU decides to go on vacation. No high temperatures, CPU at 0%, memory usage very low, nothing in ANY of the logs except the last line in each log is some kind of binary gobbledigook - so if you have any command line sensors that run a tail on the syslog for example and the like would not be able to properly interpret what is in the file. I got sick of rebooting and then truncating the log files by opening them with “sudo nano” and then using alt-t to clear out the contents before doing ctrl-x to close out each file as empty. (I know I sound like a linux noob because I am to some extent).

Anyway, it is not talked about often, but I have heard from other very senior people on this forum, evidently this is a known issue with the RPIs in general with HA but an ‘open secret’ (?). I believe this has never been resolved because there are no logs to show what the problem was - there are no error messages in ANY log files even if you have extremely verbose logging for all of them turned on (believe me, I have gone down that road as well, there is nothing to be found). It is just running normally - and then decides it is bored and wants to go on strike, so it spits some garbage obscenities onto the log files and goes on vacation (still powered up, but ‘out to lunch’ so to speak). No CPU race conditions, thread locks, hardware issues or anything like that is shown… For this reason alone others have decided against using an RPI but I did this workaround which resolved the issue permanently in an elegant manner - and it can be set/adjusted from the dashboard with no further coding needed once it is set up.

As I do it twice a week I could duplicate the below twice to show you everything I have but that is not needed to get my point across so I’ll just show the first. The day of the week selection is just an input_select with every day of the week available in the selection:

and the time is an input_datetime set to only pay attention to the time:

Here is the automation utilizing the above - it is triggered every day at the time specified and the condition is only met if the three letter representation of the day of the week matches what is selected -

alias: "Host Reboot (Per Day and Time #1 of 2)"
description: ""
trigger:
  - platform: time
    at: input_datetime.time_for_weekly_reboot_1_of_2
condition:
  - condition: template
    value_template: >-
      {{ states('input_select.day_of_weekly_reboot_1_of_2') ==
      now().strftime('%a') }}
action:
  - service: hassio.host_reboot
    data: {}
mode: single

This has worked flawlessly ever since I put it into place. the “hassio.host_reboot” is the perfect and safest way to do it as it first gracefully shuts down HA and then gracefully also shuts down all the other processes running on the RPI - which includes properly closing all open files such as log files etc. - for absolutely everything on the hardware, even software that is unrelated that might be running on it - and then reboots it (so the reboot takes a few minutes - it’s the safest and best way to avoid any issues). You just have to have everything that is running on your RPI be set to start on bootup.

I have never had the weird problem since.

Yes, I was thinking of using the calendar for this (I know some people would ask), as they also do have a ‘repeating’ feature - but when I set this up I did not know about the calendar (I have used it since for other items though). I am one who goes back and redoes things when I learn they could be done better (as it is a hobby/obsession as well), buit in this case the calendar is not better… I would rather keep it this way - this is simpler, you can just update the date and time with a couple of gui items on the dashboard even on your cell phone easily, when you are nowhere near a keyboard, once it is built, no reason to bother with any coding to adjust it… I guess you could also add input_select/s with “Enabled” and “Disabled” values which are then shown on the dashboard that you can then use in the automation code as a condition to not bother with the reboot if set as “Disabled”… (I do that with a bunch of other stuff, it is convenient…)

Hope that helps

starob · October 26, 2023, 6:02am

Sorry but this is wrong. I mixed it up with another issue I had in the past. If you are using HAOS you need to open root ssh access to the host OS. Using a terminal add-on you do not have access to /boot since you are trapped within the docker container.

Anyway here is how you can find out if you are affected by the USB driver UASP issue:
Goto settings->system->logs. Then open the logs for supervisor (top right in the UI). If the log is displayed almost immediately then you are NOT affected. If it feels like its taking forever then you ARE affected.

Calzor_Suzay · October 26, 2023, 6:53am

I tried and it opened pretty much straight away.

starob · October 26, 2023, 7:21am

Other potential causes are:

Power supply
heat

Do you have another Power supply you can test with?
You can install the System Monitor Integration to track the cpu temperature.

Calzor_Suzay · October 26, 2023, 2:32pm

I doubt it’s either to be honest, the PSU is only a few months but don’t have a spare and doubt it’s heat but I get where your going.

I’ve decided to take a totally different tack which will hopefully alleviate the problem.
I was looking at adding network cameras and using Frigate, Coral etc so have bought a fairly cheap NUC NUC7i7DNKE , i7-8650U, 16gb of ram and will look to port the whole lot over, I’m hoping a backup/restore can do this

Thanks both for your help it was very useful