Docker crashes on startup and reboots the system in a endless loop

james_popper · November 28, 2019, 2:54pm

An issue has recently appeared on our systems which are based on HassOS. This issue manifests, we assume, because of sudden power loss.

It manifests so: after the system loses power, it is started again, and during the startup process, docker fails to start. After failing to startup, the system automatically reboots itself, and so starts an endless loop of startups. Startup -> Docker fails -> reboot -> cycle repeats.

Of the messages shown during startup, the following seem to be the most relevant:

         Starting HassOS Configuration Manager...
         Starting Network Manager Wait Online...
[  OK  ] Started Network Manager Script Dispatcher Service.
[   10.349615] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   10.453699] smsc95xx 1-1.1:1.0 eth0: hardware isn't capable of remote wakeup
[   10.461157] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   10.485051] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[   10.573825] smsc95xx 1-1.5.1:1.0 eth1: hardware isn't capable of remote wakeup
[   10.581438] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[  OK  ] Started Network Manager Wait Online.
[  OK  ] Reached target Network is Online.
         Starting Docker Application Container Engine...
         Starting Avahi mDNS/DNS-SD Stack...
[  OK  ] Started Avahi mDNS/DNS-SD Stack.
[FAILED] Failed to start Docker Application Container Engine.
See 'systemctl status docker.service' for details.
[DEPEND] Dependency failed for Getty on tty1.
[DEPEND] Dependency failed for Docker SWARM Watchdog.
[DEPEND] Dependency failed for Dropbear SSH daemon.
[DEPEND] Dependency failed for COMPANY Bootstrap.
[DEPEND] Dependency failed for Serial Getty on ttyAMA0.
[  OK  ] Started HassOS Configuration Manager.
[  OK  ] Reached target Login Prompts.
[  OK  ] Reached target Multi-User System.
         Starting Update UTMP about System Runlevel Changes...
[  OK  ] Stopped Docker Application Container Engine.
[  OK  ] Closed Docker Socket for the API.
         Stopping Docker Socket for the API.
         Starting Docker Socket for the API.
[  OK  ] Listening on Docker Socket for the API.
         Starting Docker Application Container Engine...
[  OK  ] Started Update UTMP about System Runlevel Changes.
[FAILED] Failed to start Docker Application Container Engine.
See 'systemctl status docker.service' for details.

We have mounted the systems drive on a laptop, and looked at the systemd messages from docker, which are quite a lot. For the most part, the content looks like a dump of the active goroutines:

Oct 10 08:13:59 company dockerd[278]:         /build/buildroot/output/host/lib/go/src/internal/poll/fd_posix.go:56 +0x70
Oct 10 08:13:59 company dockerd[278]: os.(*File).Sync(0x3d222b8, 0x16d6564, 0x4f2dc80)
Oct 10 08:13:59 company dockerd[278]:         /build/buildroot/output/host/lib/go/src/os/file_posix.go:128 +0x3c
Oct 10 08:13:59 company dockerd[278]: github.com/docker/docker/pkg/ioutils.(*atomicFileWriter).Close(0x47c6940, 0x0, 0x0)
Oct 10 08:13:59 company dockerd[278]:         /build/buildroot/output/build/docker-engine-v18.09.0/_gopath/src/github.com/docker/docker/pkg/ioutils/fswriters.go:68 +0x5c
Oct 10 08:13:59 company dockerd[278]: github.com/docker/docker/container.(*Container).WriteHostConfig(0x3cf9e00, 0x44cc000, 0x0, 0x0)
Oct 10 08:13:59 company dockerd[278]:         /build/buildroot/output/build/docker-engine-v18.09.0/_gopath/src/github.com/docker/docker/container/container.go:255 +0x21c
Oct 10 08:13:59 company dockerd[278]: github.com/docker/docker/container.(*Container).toDisk(0x3cf9e00, 0x0, 0x0, 0x0)
Oct 10 08:13:59 company dockerd[278]:         /build/buildroot/output/build/docker-engine-v18.09.0/_gopath/src/github.com/docker/docker/container/container.go:181 +0x1e8
Oct 10 08:13:59 company dockerd[278]: github.com/docker/docker/container.(*Container).CheckpointTo(0x3cf9e00, 0x18e79a8, 0x3c0ee98, 0x0, 0x0)
Oct 10 08:13:59 company dockerd[278]:         /build/buildroot/output/build/docker-engine-v18.09.0/_gopath/src/github.com/docker/docker/container/container.go:192 +0x1c
Oct 10 08:13:59 company dockerd[278]: github.com/docker/docker/daemon.(*Daemon).checkpointAndSave(0x3cf4d00, 0x3cf9e00, 0x0, 0x0)
Oct 10 08:13:59 company dockerd[278]:         /build/buildroot/output/build/docker-engine-v18.09.0/_gopath/src/github.com/docker/docker/daemon/daemon.go:1391 +0x84
Oct 10 08:13:59 company dockerd[278]: github.com/docker/docker/daemon.(*Daemon).restore.func1(0x4901a00, 0x3cf4d00, 0x4901a10, 0x4576c80, 0x4576c60, 0x4576c40, 0x3cf9e00)
Oct 10 08:13:59 company dockerd[278]:         /build/buildroot/output/build/docker-engine-v18.09.0/_gopath/src/github.com/docker/docker/daemon/daemon.go:273 +0x6c
Oct 10 08:13:59 company dockerd[278]: created by github.com/docker/docker/daemon.(*Daemon).restore
Oct 10 08:13:59 company dockerd[278]:         /build/buildroot/output/build/docker-engine-v18.09.0/_gopath/src/github.com/docker/docker/daemon/daemon.go:270 +0x978
Oct 10 08:13:59 company dockerd[278]: goroutine 163 [runnable]:
Oct 10 08:13:59 company dockerd[278]: syscall.Syscall(0x76, 0xc6, 0x0, 0x0, 0x0, 0x0, 0x0)
Oct 10 08:13:59 company dockerd[278]:         /build/buildroot/output/host/lib/go/src/syscall/asm_linux_arm.s:14 +0x8
Oct 10 08:13:59 company dockerd[278]: syscall.Fsync(0xc6, 0x16d8d5c, 0x40f7580)
Oct 10 08:13:59 company dockerd[278]:         /build/buildroot/output/host/lib/go/src/syscall/zsyscall_linux_arm.go:449 +0x30
Oct 10 08:13:59 company dockerd[278]: internal/poll.(*FD).Fsync(0x40f7580, 0x0, 0x0)
Oct 10 08:13:59 company dockerd[278]:         /build/buildroot/output/host/lib/go/src/internal/poll/fd_posix.go:56 +0x70
Oct 10 08:13:59 company dockerd[278]: os.(*File).Sync(0x41b39b8, 0x16d6564, 0x4629c80)
Oct 10 08:13:59 company dockerd[278]:         /build/buildroot/output/host/lib/go/src/os/file_posix.go:128 +0x3c
Oct 10 08:13:59 company dockerd[278]: github.com/docker/docker/pkg/ioutils.(*atomicFileWriter).Close(0x4793080, 0x0, 0x0)
Oct 10 08:13:59 company dockerd[278]:         /build/buildroot/output/build/docker-engine-v18.09.0/_gopath/src/github.com/docker/docker/pkg/ioutils/fswriters.go:68 +0x5c
Oct 10 08:13:59 company dockerd[278]: github.com/docker/docker/container.(*Container).WriteHostConfig(0x4359040, 0x4a25900, 0x0, 0x0)
Oct 10 08:13:59 company dockerd[278]:         /build/buildroot/output/build/docker-engine-v18.09.0/_gopath/src/github.com/docker/docker/container/container.go:255 +0x21c
Oct 10 08:13:59 company dockerd[278]: github.com/docker/docker/container.(*Container).toDisk(0x4359040, 0x0, 0x0, 0x0)
Oct 10 08:13:59 company dockerd[278]:         /build/buildroot/output/build/docker-engine-v18.09.0/_gopath/src/github.com/docker/docker/container/container.go:181 +0x1e8
Oct 10 08:13:59 company systemd[1]: docker.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Oct 10 08:13:59 company dockerd[278]: github.com/docker/docker/container.(*Container).CheckpointTo(0x4359040, 0x18e79a8, 0x3c0ee98, 0x0, 0x0)
Oct 10 08:13:59 company dockerd[278]:         /build/buildroot/output/build/docker-engine-v18.09.0/_gopath/src/github.com/docker/docker/container/container.go:192 +0x1c
Oct 10 08:13:59 company dockerd[278]: github.com/docker/docker/daemon.(*Daemon).checkpointAndSave(0x3cf4d00, 0x4359040, 0x0, 0x0)
Oct 10 08:13:59 company dockerd[278]:         /build/buildroot/output/build/docker-engine-v18.09.0/_gopath/src/github.com/docker/docker/daemon/daemon.go:1391 +0x84
Oct 10 08:13:59 company dockerd[278]: github.com/docker/docker/daemon.(*Daemon).restore.func1(0x4901a00, 0x3cf4d00, 0x4901a10, 0x4576c80, 0x4576c60, 0x4576c40, 0x4359040)
Oct 10 08:13:59 company dockerd[278]:         /build/buildroot/output/build/docker-engine-v18.09.0/_gopath/src/github.com/docker/docker/daemon/daemon.go:273 +0x6c
Oct 10 08:13:59 company dockerd[278]: created by github.com/docker/docker/daemon.(*Daemon).restore
Oct 10 08:13:59 company dockerd[278]:         /build/buildroot/output/build/docker-engine-v18.09.0/_gopath/src/github.com/docker/docker/daemon/daemon.go:270 +0x978

The above is only a small portion of the actual output, which includes more instances of goroutine stacks.

After that, there are a few messages like the following:

Sep 06 12:10:02 company dockerd[476]: time="2019-09-06T12:10:02.712578011Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.aufs" error="modprobe aufs failed: "modprobe: FATAL: Module aufs not found in directory /lib/modules/4.14
.66-v7\n": exit status 1"
Sep 06 12:10:02 company dockerd[476]: time="2019-09-06T12:10:02.712728688Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.native"..." type=io.containerd.snapshotter.v1
Sep 06 12:10:02 company dockerd[476]: time="2019-09-06T12:10:02.712879261Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.overlayfs"..." type=io.containerd.snapshotter.v1
Sep 06 12:10:02 company dockerd[476]: time="2019-09-06T12:10:02.713234365Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.zfs"..." type=io.containerd.snapshotter.v1
Sep 06 12:10:02 company dockerd[476]: time="2019-09-06T12:10:02.714401709Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.zfs" error="path /mnt/data/docker/containerd/daemon/io.containerd.snapshotter.v1.zfs must be a zfs filesy
stem to be used with the zfs snapshotter"
Sep 06 12:10:02 company dockerd[476]: time="2019-09-06T12:10:02.714484470Z" level=info msg="loading plugin "io.containerd.metadata.v1.bolt"..." type=io.containerd.metadata.v1
Sep 06 12:10:02 company dockerd[476]: time="2019-09-06T12:10:02.714582438Z" level=warning msg="could not use snapshotter aufs in metadata plugin" error="modprobe aufs failed: "modprobe: FATAL: Module aufs not found in directory /lib/modules/4.14.66-v7
\n": exit status 1"
Sep 06 12:10:02 company dockerd[476]: time="2019-09-06T12:10:02.714642178Z" level=warning msg="could not use snapshotter zfs in metadata plugin" error="path /mnt/data/docker/containerd/daemon/io.containerd.snapshotter.v1.zfs must be a zfs filesystem t
o be used with the zfs snapshotter"
Sep 06 12:10:02 company dockerd[476]: time="2019-09-06T12:10:02.715197386Z" level=info msg="loading plugin "io.containerd.differ.v1.walking"..." type=io.containerd.differ.v1
Sep 06 12:10:02 company dockerd[476]: time="2019-09-06T12:10:02.715314105Z" level=info msg="loading plugin "io.containerd.gc.v1.scheduler"..." type=io.containerd.gc.v1
Sep 06 12:10:02 company dockerd[476]: time="2019-09-06T12:10:02.715483532Z" level=info msg="loading plugin "io.containerd.service.v1.containers-service"..." type=io.containerd.service.v1
Sep 06 12:10:02 company dockerd[295]: pc      0x76da4198
Sep 06 12:10:02 company dockerd[295]: cpsr    0x10
Sep 06 12:10:02 company dockerd[295]: fault   0x0

In these messages, the ones about aufs failiing seem to be important (also because of the exit status 1 message, but they are marked as just warning, so im not sure if they actually contribute to the issue.

After the stack traces, this shows up:

Oct 10 07:57:55 company dockerd[655]: time="2019-10-10T07:57:55.822297680Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Oct 10 07:57:55 company dockerd[655]: time="2019-10-10T07:57:55.825508096Z" level=info msg="Loading containers: start."
Oct 10 07:57:56 company kernel: RTL871X: RTW_ADAPTIVITY_EN_
Oct 10 07:57:56 company kernel: AUTO, chplan:0x20, Regulation:3,3
Oct 10 07:57:56 company kernel: RTL871X: RTW_ADAPTIVITY_MODE_
Oct 10 07:57:56 company kernel: NORMAL
Oct 10 07:57:58 company kernel: RTL871X: nolinked power save leave
Oct 10 07:58:00 company dockerd[655]: runtime/cgo: pthread_create failed: Resource temporarily unavailable
Oct 10 07:58:00 company dockerd[655]: SIGABRT: abort
Oct 10 07:58:00 company dockerd[655]: PC=0x76da1198 m=2 sigcode=4294967290
Oct 10 07:58:00 company dockerd[655]: goroutine 0 [idle]:
Oct 10 07:58:00 company dockerd[655]: runtime: unknown pc 0x76da1198
...
Oct 10 07:58:01 company systemd[1]: docker.service: Start request repeated too quickly.
Oct 10 07:58:01 company systemd[1]: docker.service: Failed with result 'exit-code'.
Oct 10 07:58:01 company systemd[1]: Failed to start Docker Application Container Engine.
Oct 10 07:58:01 company systemd[1]: docker.socket: Failed with result 'service-start-limit-hit'.
Oct 10 07:58:27 company systemd[1]: Starting HassOS rauc good...

Noteworthy is the Resource temporarily unavailable, which the internet says occurs when docker cannot spawn any more processes.

I’ve searched for possible solutions but nobody has addressed the cause of the issue and only applied a workaround, like reflashing the SD card. Reflashing the SD card is not an option, for one because we do not use an SD card, but also because the system is used in a production environment where user data has to be persisted at all times.

Similar topics I’ve found on the HomeAssistant Forum:

Hassio won't start - "Failed to start Docker application container engine" (screenshot displays exactly the output I see as well)
Hass.io won't boot. How can I recover my configuration files?

Any help to debug this issue would be great.

The way I see it, there are two problems here:

docker crashing (maybe because of the number of processes it spawned? but our limit is set to 2021, so im not sure how it got that fast to that limit)
docker crashing causes the reboot

Some details about the system:

we only use HassOS, not HomeAssistant
our storage is not an SD Card, it’s an eMMC
docker version (taken from an unaffected system)

Client:
 Version:           v18.09.0
 API version:       1.39
 Go version:        go1.11.4
 Git commit:        v18.09.0
 Built:             unknown-buildtime
 OS/Arch:           linux/arm
 Experimental:      false

Server:
 Engine:
  Version:          18.09.0
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.11.4
  Git commit:       
  Built:            Mon Nov 18 17:22:36 UTC 2019
  OS/Arch:          linux/arm
  Experimental:     false

ohessel · February 7, 2020, 4:28pm

I found the issue
…first of all, docker can crash sometimes when it has too many (dead) containers to load.
This might not happen to everyone, cause it looks like it’s a docker swarm related issue, whilst it can not be said that it’s the only.

With the docker process being the one that everything depends on in systemd, systemd causes a restart.

Also - if that’s not the issue, the hassos-rate (rauc-checker) looks if docker crashed - 1 minute after systemd start, if docker failed - rauc marks the partition as bad and reboots.

nishith · January 12, 2021, 9:32pm

Can you please also share how to fix this.

I am running off an SD card on a Raspberry Pi4. (Hassos)
I don’t have any backups etc…
I don’t mind reinstalling if I can figure out how to save my current configurations etc

OM_luxbg · October 28, 2021, 6:47pm

Did someone else met this issue ‘service-start-limit-hit’ and how to solve it ?

OM_luxbg · October 28, 2021, 9:52pm

the supervisor installer script in this code section:

# Detect wrong docker logger config
if [ ! -f "$FILE_DOCKER_CONF" ]; then
  # Write default configuration
  info "Creating default docker daemon configuration $FILE_DOCKER_CONF"
  curl -sL ${URL_DOCKER_DAEMON} > "${FILE_DOCKER_CONF}"

${URL_DOCKER_DAEMON} tries to retrieve the https://raw.githubusercontent.com/home-assistant/supervised-installer/master/files/docker_daemon.json but the file can’t be found and it writes to /etc/docker/daemon.json the content 404.

Thus docker daemon cant’t start anymore with the message ‘service-start-limit-hit’