Debugging a startup crash

achatham · October 24, 2022, 5:21pm

I haven’t been able to get a release newer than 2022.6.7 to start up. Each newer release crashes with (no stacktrace):

terminate called after throwing an instance of 'std::runtime_error'
  what():  random_device: rdrand failed

I’ve tried changing the logger to debug, but I can’t tell which component is causing the problem, as it’s all initialized in parallel.

I have an old docker setup using GitHub - tribut/homeassistant-docker-venv: Run Home Assistant as non-root using the official docker image to avoid running as root, so I wouldn’t be surprised if I’ve messed myself up that way. But I still have the general questions:

Can I force serial initialization to narrow down which code is crashing?
Can I disable integrations from configuration.yaml? The service doesn’t stay up long enough for me to disable them in the UI.

achatham · October 24, 2022, 6:24pm

After further debugging, this is caused by default_config:. If I remove that things don’t crash. Oddly, I replaced it with every plugin mentioned by default_config and it still doesn’t crash.

eg:

application_credentials:
automation:
...
zeroconf:
zone:

I also tried attaching with debugpy but it doesn’t catch anything, presumably because this is a crash in the C code.

CentralCommand · October 24, 2022, 6:32pm

That is quite bizarre. You added all the integrations in default_config. You didn’t remove any of them just listed them all instead of default_config? Definitely seems like a bug in that case.

Did you just add debugpy or did you add this:

debugpy:
  start: true

Setting start: true causes HA to wait for a debugger to be attached at the very beginning of startup before proceeding with the rest of startup. If trying to debug a startup issue I would do that. Otherwise you’re just in a race to see whether you can attach the debugger before it hits whatever is breaking, computers typically win over humans in races like that.

achatham · October 24, 2022, 10:43pm

I commented out debug_config and added all the other configs. Some were duplicated later in the config with my own real definitions, like scripts:.

I added start: true and was able to trace everything in the debugger. The problem is that it just hard-exited with no indication in the debugger, since it wasn’t a python-level crash.

achatham · October 24, 2022, 11:03pm

Made more progress. I’m sure this is going to be due to my weird container setup in the end, but documenting this in case anyone else runs into similar problems.

I attached with debugpy, paused the main thread, and then ran import faulthandler; faulthandler.enable() so I get a stack trace when the crash happens:

  what():  random_device: rdrand failed
Fatal Python error: Aborted
Thread 0x00007f6eb0e7fb30 (most recent call first):
  File "/usr/local/lib/python3.10/ssl.py", line 1130 in read
  File "/usr/local/lib/python3.10/ssl.py", line 1274 in recv_into
  File "/usr/local/lib/python3.10/socket.py", line 705 in readinto
  File "/usr/local/lib/python3.10/http/client.py", line 465 in read
  File "/usr/local/lib/python3.10/site-packages/urllib3/response.py", line 532 in _fp_read
  File "/usr/local/lib/python3.10/site-packages/urllib3/response.py", line 566 in read
  File "/usr/local/lib/python3.10/site-packages/urllib3/response.py", line 627 in stream
  File "/usr/local/lib/python3.10/site-packages/requests/models.py", line 816 in generate
  File "/usr/local/lib/python3.10/site-packages/sseclient/__init__.py", line 48 in _read
  File "/usr/local/lib/python3.10/site-packages/sseclient/__init__.py", line 58 in events
  File "/usr/local/lib/python3.10/site-packages/nest/nest.py", line 1792 in _start_event_loop
  File "/usr/local/lib/python3.10/threading.py", line 953 in run
  File "/usr/local/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/local/lib/python3.10/threading.py", line 973 in _bootstrap
  File "/usr/local/lib/python3.10/site-packages/debugpy/_vendored/pydevd/_pydev_bundle/pydev_monkey.py", line 1053 in __call__

achatham · October 25, 2022, 11:15pm

I’m good now. Seems like there was an rdrnd bug in Ryzen chips and I needed to apply a microcode update. Bizarre journey.