HA stops work every Monday

I had this band aide running for 4 years before HA was stable… Sometimes the juice isn’t worth the squeeze. If it’s a hardware issue that can’t be located (or takes 2 years to be located), what would you rather have? A band aide where you don’t have to do anything, or a pain in your ass every monday morning when you wake up?

2 Likes

True, true, but if we don’t start, we’ll never know. If it turns out to be a hadware issue, we can still move on, at least knowing we tried.

It wasn’t meant as criticism, a band aid is better than nothing, that’s for sure, but right now some more information wouldn’t be too much to ask, don’t you think? :slight_smile: At least it will show, how serious people are about that issue they have. :smiley: :smiley: :smiley:

@pedro @paddy0174 funny you mention this, I restart my home assistant each night at 03:23 but sunday night the automated reboot does not happen due to the freeze.
And yes I am very interested in working with you guys to get to the bottom of this. As you can see I posted my configuration earlier, the only thing missing was my power supply, 5V 2A.

So, My hardware consists of a raspberry PI 3B (V1.2), a zzh (CC2652R Stick) zigbee stick and a 32 GB Micro SD card.
Integrations:
Agent DVR, 2x FRITZ!Box (router), HACS, Meteorologisk Institutt, Mobile App, MQTT mosquitto broker, Philips HUE, Raspberry pi power supply checker, IKEA, TUYA
Hacs integrations:
Afvalwijzer (waste scheduling), SONOFF LAN, Fritz!box tools, Yahoo finance.
Add-ons:
Check HA config, DuckDNS, file editor, mosquitto broker, Samba share, terminal @SSH, Wireguard, zigbee2MQTT.

I installed the standardised home assistant operating system through Raspberry Pi - Home Assistant (home-assistant.io). Now on version core-2021.3.3
Router: Fritz!Box 7590
Philips Hue Bridge and motion sensor
Ikea bridge and two lightbulbs
4 Aqara motion detectors (added after freeze)
4 aqara temperature sensors (added after freeze)
3-5 Sonoff mini’s
3-5 wifi wall plugs (Tuya)
2 smoke alarms (added after freeze)
Cat is not allowed to come near the PI… Neither is the dog… Not the season for spiders

I had this idea too. But it makes no difference whether HA runs for a whole week or whether it is restarted several times. The crash inevitably came from Sunday to Monday at 1:00 am

1 Like

what’s your timezone?

GMT +1 (Amsterdam)

Hmm, looking at the logbook (my bad should have done that before an earlier remark),
last night my reboot automation did kick in at 02:23 last entry is homeassistant stopped at 02:23:03, then nothing until the reboot this morning.
Looked at the previous week, The automated reboot did run at 02:23, entry home assistant stopped at 02:23:03 was entered but not the last several @home mentions,
last entry into log at 02:28 from my mobile phone , then nothing again

I’m in GMT timezone and mine still crashes just after 01:00 GMT.

Thanks @paddy0174 and @petro for stepping in!

  • what system are you running? (HassOS, python venv, supervised install) In short, how did you install HA?

HassOS - core-2021.3.4 - supervisor-2021.03.6 - Home Assistant OS 5.12. Downloaded from the site around may 2019 wrote to SD card with balena etcher. Very recently I did a re-download and re-install on a brand new SD card and did use a recent snapshot to get going again. It did’nt solve my issue, as written in this thread.

  • what is the exact hardware you are running? Pi4 with 2GB? Pi 3B+? Please don’t forget to mention all things apart from the standard, eg. are you using an SSD or the microSD?

Raspberry Pi3B+; Kingston 16GB Class 4 micro SD. The replacement one I used recently was identical and brand new ( I bought multiple of these in Mai 2019). Aeon labs Z-stick

  • if you’re running a Pi, what power source are you using? We need the values here like Volt and Ampere from the power source.

The official Rpi power supply that came with the starter pack of the Rpi 3B+: 5.1V - 2.5A.

  • are there any outside things that could cause this? Maybe the cat laying on your Pi every Monday? :wink:

The Rpi is indoors in a professional (IP-rated) enclosure and there are at least 2 closed doors between it and the cats at that time of the week. :wink:

  • Can someone provide a somewhat exact timeline of the issue? What I can see here, it seems not always to be the exact same time, but the better one of you could tell, when this happens, the easier we can find something.

In my case it stops not always at the exact same time. I do not know the exact time, as I’m sleeping at that time of day, but the latest recordings in the history are between 1am and 3am, mostly around 1:30am. I’m in the Netherlands, which is GMT+1 in winter.

  • Router could be an issue, so please provide the model. I personally don’t think it is, so this not a necessary information.

Ziggo connect-box, which is being provided by the ISP and is probably a Compal CH7465LG. I know it’s not the best one.

eight replies and not one question is answered

I have work to do during daytime and I also prefer to write a complete and correct answer in stead of a fast reply. Most of the above information was already in this thread or the linked thread. There was less than 90 minutes between the post with the questions and this post. I’ll just assume you were being sarcastic :wink:

I’m willing to give it a try, but I’m not overly enthusiastic about the proposed band-aid. As others already pointed it, it will probably not work as it seems that automations do not work after the crash. In my view It’s also not really a minor issue. Having basic functionality running stable is a must for a home automation system IMHO. I believe we are doing Home Assistant and it’s users (including myself) a favor by trying to actually solve this.

I really have strong doubts that the crashes in someway are being caused by the HW. The same HW has been running for ~18months without any issues. The problem is that I do not exactly know when the issues started, but I guess it was somewhere in early/mid January. The first time I was surprised, but simply restarted without paying too much attention. These things can happen, right? The second time I was surprised again and wondered what could have been the cause. I guess after the third or fourth time I realized it was always on Mondays. Then a few weeks went by trying to figure the cause myself and about 3 weeks ago I started seeking help in this community. So, I do not really know since which FW version the issues started.

Again, thanks for stepping in!

I’d like to be clear here: I’m just a moderator. I’m not taking point on any fixes. All this information should be written up into an issue on github.

EDIT: For further clarification, a link to this post is not sufficient. These details need to go inside the issue. Hint hint @nelsonamen @Harmpert

Understood @petro. I also understand (and feel) you do not owe my any help whatsoever. So, any help is appreciated. You do however have me puzzled. I propose to open an issue, but @paddy0174 suggests to not do so and first provide some additional info. Once provided that info you reply that the information should go into the issue :man_shrugging:
What would the best process then?

The best process is to have the information inside an issue written up on github. Currently the issue has Zero information in it with a link to this thread. That’s a surefire way of getting ignored.

1 Like

Thanks @petro!
I have added my entries to the issue written on Github. Saw @Plevuus did the same.
Let’s see if we can get some of the others to add their configuration too and hopefully someone will rise to the challenge of trying to find the issue.

Do you have another machine you could use as a test instance?
Restore your backup to that and see if it still crashes

Yes I was, sorry if I offended you, wasn’t my intention. :wink: And it was totally not aimed in your direction. :wink: We are all here to help and as with most things in life, together people are stronger and better in finding solutions. :slight_smile:

The reason why I proposed to wait with an issue on Github is, if the report isn’t in a way for developers to work with, it will get closed rather sooner than later. So that’s where a community board comes in. Someone needs to collect and sort the different things people notice, to give the developing team something to work on. I have to be honest, I’m the same with issues on my projects. If it isn’t informative, I won’t follow it. To make a long story short, here in the forum we collect and clean up, than someone can open (or in this case update) the issue with mostly only relevant info. Just if we can’t get anywhere, we try to get a developer on board to give some insight. :slight_smile:

But now, back to topic. :wink:

You’d be surprised how often simple tasks can lead to a system cutout. That’s why I’m trying to find similarities between all these setups. Let me give you an example (not related to this issues): on one of my Pis I had a constant cutout and restart to very different times throughout the day. It turned out to be a badly configured Bluetooth connection that halted nearly the whole system and in the end needed so much power, that my power source wasn’t able to deliver…

My point is, we don’t know, what updates have been implemented inside HassOS, that we aren’t aware of.

What makes me think about an OS level issue, is the direct ending in the log. If a system is going to be halted, it normally doesn’t stop in the middle of a sentence. It sets a line in the logs or it doesn’t, but not something inbetween. That leads me to think something with the power could be off. I suspect some routine that is running on OS level, like a cronjob or something like that. The time frame does support that a little, as it would be a great time, running something at the start of a new week. Maybe a check that starts at 00:00h on Monday morning and depending on the level of hanging in the system, it leads to a spike in using power after some time, that can’t be handled. Combine this with an updated driver for such a routine (updated somewhere in December last year), you could get exactly what you’re experiencing, even with no change in hardware components.

What goes against this theory is, the HA automations seem to run for much longer than the rest. That would point to Supervisor hangs itself… :wink:

So what do we have right now (Harmpert, Wilber, Plevuus):

  • 2x Pi 3B, 1x Pi 3B+,
  • 3x microSD (1x 16GB, 1x 32GB)
  • 1x 2.5A, 1x 2.0A, Wilber n/a
  • 2x HassOS, 1x n/a
  • 3x GMT +1

Does anyone of you know, when that issues started? The more precise, the better, but a month will still do for a start. :slight_smile:
Can some of you provide some kind of log, let’s say around one and a half hour before the crash. Same here, whatever you got is good.
What I really would want to see is the supervisor log for that time, but I’ll have to check how to get that from an SDcard (afterwards).

Just to clarify for me, none of you have any problems during the week, it is always the one freeze on Sunday night / Monday morning?

PS: My suggestion would be to leave the issue as it is, one of you can always start a new one, when we have gathered relevant informations. :slight_smile: But that’s not my choice to make, do as you please. :slight_smile:

Could it be that you have a family member that disconnects HA and any surveillance equipment on purpose to get some privacy on Mondays?

@paddy0174 When I think back my problems started around november/december. Installed the philips hue bridge and motion detector around that time but that might be unrelated.
I do have 2 iphones, 1 android phone and 2 android tablets connected through the mobile app

@Messier1994 :rofl: Nope, I have a motion detector right next to my PI and that one does not show any movement around the time…

1 Like

To update on paddy0174’s correlation list, my setup:

Pi3B+
MicroSD (64GB A2)
2A PSU
HassOS (always on the latest version)
And I’m on GMT, (not GMT+1.)

On the PSU front, I’ve previously had a “dodgy” PSU that would occasionally cause under voltage warnings - and this did not cause any crashes/failures.
For me, I think the issue started around January - but it took me a while to notice it was every Monday at the same time

No worries! I’m not easily offended. (please do not take this as a challenge :wink: )

Well, I guess in case of your example one could argue if it was caused by a HW or SW issue. Let’s no go there, I get your point. I am using the Rpi power status sensor since about a week and has been OK all the time, also just before the crash. I’ll try to add some of the System Monitor sensors. I’ll add processor_temperature , memory_use_percent, disk_use_percent, processor_use. Any other suggestions?

I really cannot recall when the issue started occurring. As mentioned before, I guess somewhere in January. In the starter post of @DomJo of 11 January Dominik mentions since mid/end December. It could be that I was not really running the latest Home Assistant version for a few weeks. I do manual updates and I’m never far behind, but not always completely up-to-date.

I have no issues throughout the week. Only on Monday mornings. (except for 2 weeks ago, but that was solved by re-downloading and re-installing Home Assistant on an new SD and using a snapshot to get going again)

Uh, a challenge. :smiley: No, won’t do that, I promise!

At the moment I’m just assuming it started somewhere after the beginning of December, so I started to go through all PRs for supervisor from that point on. This is just an assumption, but I think supervisor is the best place to start. Maybe something turns up.

Unfortunately one of my other ideas, regarding the timezone, proved to be wrong, so this is out for the moment. :frowning:

Any ideas are taken, if you have something in mind, get it out. :smiley: As I said, for the moment my best guess is supervisor, so I’m trying my luck there. But tbh, every idea that sounds better than searching supervisor PRs is very welcome. :smiley: :smiley: :smiley: