New issue: Something is overwhelming my system

In the past week or so, my system has come to an unusable standstill, once, sometimes twice a day, forcing me to reboot it the hard way…power off/on. I’m at a loss as to what’s wrong or where to look. Logs seen to initialize at power up, or am I not looking in the right place?
I always stay current with updates so it’s impossible to tell whether HA, an add-on or an integrationI is at fault.
All feedback is welcome to help me narrow this down…

I had the same thing yesterday. Currently on 2023.3.5. HA completely crashed and had to be power cycled. The log just shows it failing to connect to other smart devices and then… nothing. It also happened to me a few time last month (back on 2023.2.x).

Interesting!!! I think it started with 2023.3.x. Everything just slowly stops working, like not enough CPU. I think I’ve still been getting mobile notifications…weird.

It’s been pretty much daily when I wake up…system not working, usually 5am to 8am.The odd time much later in the day, like late afternoon, early evening.
Today was late afternoon.

I’m scratching my head trying to figure out where to look…

I think I might be done with doing core updates when they come out…
Once burned, twice shy…

Yep, I don’t know either unfortunately. Like I say, I had a spate of these crashes last month but then they stopped. Yesterday was my first one since updating to 2023.3.x

I’m running HA on an i7 NUC with heaps of RAM so it isn’t running out of resources…

Sane here. I’m on an HA Blue ODROID N2+ and heaps of resources…

I had an issue with backups consuming all disk space a few months ago but resolved that. Only 28% used now.

Just updated to 2023.3.6.

My Zigbee seems to be the first to die but http background stuff randomly packs it in. By the time I get to it, SSH isn’t even working.

Hmmm, I’ve had similair issues. First they were related to the daily samba backup to my NAS. But last week I’ve had one where the logs showed an unability to connect to one SONOS box and boom…the entire system went into freeze…it took a couple of minutes waiting before I could connect again. A day later the system had to be cold booted.

Might be something like a database purge depending on the size of your database i.e. the home-assistant_v2.db file In this case rebooting would only delay this from happening again.

I am using a well trimmed mariadb, so that’s not it, plus the timing does not coincide.

recorder:
  db_url: !secret mariadb_url
  db_retry_wait: 5
  purge_keep_days: 10
  commit_interval: 5
  include:
    entities:
      - sensor.circuit_11_power
    domains:
      - automation
      - binary_sensor
      - climate
      - cover
      - device_tracker
      - input_boolean
      - input_datetime
      - input_number
      - input_select
      - input_text
      - light
      - lock
      - media_player
      - person
      - sensor
      - switch
  exclude:
    event_types:
      - automation_triggered
      - call_service
      - component_loaded
      - feedreader
      - homeassistant_start
      - homeassistant_stop      
      - logbook_entry
      - platform_discovered
      - script_started
      - service_executed
      - service_registered
      - service_removed
      - system_log_event
      - timer_out_of_sync
    entities:
      - sensor.insteon_groups
      - sensor.home_assistant_v2_db
      - sensor.last_active
      - sensor.last_boot # Comes from 'systemmonitor' sensor platform
      - sensor.mariadb_database_size
      - sensor.processor_use
    entity_globs:
    # Emporiavue
      - sensor.sensor.circuit_*
      - sensor.car_charger*
      - sensor.dishwasher*
      - sensor.dryer*
      - sensor.fridge*
      - sensor.furnace*
      - sensor.microwave*
      - sensor.range*
      - sensor.emporia_*
     # Others
      - sensor.*battery*
      - sensor.*node_status*
      - sensor.*voltage*
      - sensor.*power*
      - sensor.*rx*
      - sensor.*tx*
      - sensor.browser_mod*
      - sensor.clock*
      - sensor.date*
      - sensor.glances*
      - sensor.hass_agent*
      - sensor.internet*
      - sensor.load_*
      - sensor.memory_*
      - sensor.processor_*
      - sensor.sun*
      - sensor.time*
      - sensor.*uptime*
      - sensor.weather*
1 Like

I can agree with @dbrunt . I also run mariadb and the issue doesn’t time with the purge.

My DB is 1545Mb and the purge doesn’t line up either. It’s quite random when it crashes and when it does, I lose SSH access too. The whole thing is completely unresponsive.

My issue is not db related…been there, done that, not new to this! (but I appreciate the feedback!)
image
Plus it simply does not correlate, and is random times of day (and it’s been daily!)
2:27am here and I expect it to die within 24 hours again.
“Alexa, good morning” and …no response!

At least I seem to not be alone in this!

It’s always nice to not be alone :smiley:

Maybe the issue is related to the process around waiting for connectivity with systems which might be asleep. Like my SONOS system which isn’t used 24/7. Or in your case some Zigbee sensors not being active when the system tries to connect to it.

Just guessing :face_with_raised_eyebrow: …perhaps the developers already know of this and are working on a sollution.

My observations:

  • It’s not very new problem. I experience it for last few months although irregularly.
  • Just before crash the CPU usage goes up to 100% for a while
  • It happens for me ALWAYS between 00:00 and 02:00am CET - that’s why I am convinced it’s some maintenanance process (like DB cleaning or so) causing this, even if indirectly
  • It SEEMS to have something to do with number of entities / DB size. I maintain 3 independent installations on EXACTLY the same hardware and addons and the largest is mostly affected
1 Like

I did have a database issue several months ago when it was near 9GB and the system would grind to a halt shortly after 5am PST which I learned was my backup time. I changed the backup time and the time of the issue followed it. The backup would take the backup offline to back it up but due to the size, was offline for a lengthy period. While offline, HA caches its changes and due to the volume of changes and the length of time, the cache limit was being exceeded and HA would halt. Once I figured out the culprit, I managed to prune my database down to 900 MB plus severely cut back on the items being logged and that problem stopped. However this issue, while usually 5am to 9am, is often other random times of the day.

Your issue definitely sounds like the one I had. You need to have a hard look at your recorder settings.

The difference is that my DB is below 300MB since always - I am cutting unneeded sensors from recorder from the beginning of their life and keep only 5 days back since I do not need history too much.

I also do not have automatic backups configured. I do manually once a week in “normal” hours :slight_smile:

I updated to HA 2023.3.6 last night and then HA crashed in the early hours of this morning. Once again there is nothing notable in the HA log file.