s6-svwait not reaping zombies? from Daniel Griscom on 2021-07-21 (skaware)

From: Daniel Griscom <griscom_at_suitable.com>
Date: Wed, 21 Jul 2021 15:19:26 -0400

Hello, all. I'm using s6 as the init process manager in a Docker
container, using s6-overlay Everything's working fine, but I send a
SIGINT to the container, the processes being managed exit, but they
become zombies and aren't reaped, forcing the system to timeout (twice,
actually).

I'm using ubuntu:20.04 as a container using s6-overlay amd64 version
2.2.0.3, which I believe has the latest s6. All runs on an Ubuntu 18.04
system. It looks like s6-svscan sends SIGINT or SIGTERM to the
processes, and then uses s6-svwait to wait for the processes to exit,
but the zombie processes are never reaped.

I found the following reference that suggests the problem might be a
kernel problem: https://github.com/just-containers/s6-overlay/issues/135
, although I'm not seeing the high zombie CPU usage referenced. I also
found https://wiki.gentoo.org/wiki/S6 , which suggested that sending a
SIGCHLD to s6-svscan would cause it to re-scan for zombies that didn't work.

Here are the processes once everything is started (viewed by "ps axl"
after running bash in a separate connection to the container):
> root_at_4fa66da81d02:/# ps axl
> F   UID   PID PPID PRI NI    VSZ   RSS WCHAN STAT TTY        TIME
> COMMAND
> 4     0     1     0 20   0    196     4 poll_s Ss+ pts/0      0:00
> s6-svscan -t0 /var/run/s6/services
> 4     0    35     1 20   0    196     4 poll_s S+   pts/0      0:00
> s6-supervise s6-fdholderd
> 4     0   228     1 20   0    196     4 poll_s S+   pts/0      0:00
> s6-supervise thttpd
> 4     0   229     1 20   0    196     4 poll_s S+   pts/0      0:00
> s6-supervise exrouter
> 4 65534   232   228 30 10 179052 165784 poll_s SNs ?          0:00
> /opt/pdm/bin/thttpd -nip -nos -c **.html|**.sh|
> 4     0   233   229 30 10   6224 1568 poll_s SNs ?          0:00
> /opt/pdm/bin/exrouter-cpp
> 4     0   247     0 20   0   5996 3756 do_wai Ss   pts/1      0:00 bash
> 4     0   255   247 20   0   7568 3024 -      R+   pts/1      0:00 ps axl
And, once I issue a ^C to the container, but before any timeout:
> root_at_4fa66da81d02:/# ps axl
> F   UID   PID PPID PRI NI    VSZ   RSS WCHAN STAT TTY        TIME
> COMMAND
> 4     0     1     0 20   0    176     4 do_wai Ss+ pts/0      0:00
> foreground backtick -D 3000 -n S6_SERVICES
> 4 65534   232     1 30 10      0     0 -      ZNs ?          0:00
> [thttpd] <defunct>
> 4     0   233     1 30 10      0     0 -      ZNs ?          0:00
> [exrouter-cpp] <defunct>
> 4     0   247     0 20   0   5996 3860 do_wai Ss   pts/1      0:00 bash
> 0     0   271     1 20   0    176     4 do_wai S+   pts/0      0:00
> foreground s6-svwait -D -t 10000 /var/run/
> 4     0   278   271 20   0    204     8 poll_s S+   pts/0      0:00
> s6-svwait -D -t 10000 /var/run/s6/services/thtt
> 4     0   279   278 20   0    452     4 poll_s S+   pts/0      0:00
> s6-ftrigrd
> 4     0   280   247 20   0   7568 2976 -      R+   pts/1      0:00 ps axl
And, after the system times out and sends SIGTERM to all the processes:
> root_at_4fa66da81d02:/# ps axl
> F   UID   PID PPID PRI NI    VSZ   RSS WCHAN STAT TTY        TIME
> COMMAND
> 4     0     1     0 20   0    176     4 do_wai Ss+ pts/0      0:00
> foreground backtick -D 3000 -n S6_KILL_GRA
> 4 65534   232     1 30 10      0     0 -      ZNs ?          0:00
> [thttpd] <defunct>
> 4     0   233     1 30 10      0     0 -      ZNs ?          0:00
> [exrouter-cpp] <defunct>
> 4     0   279     1 20   0      0     0 -      Z+   pts/0      0:00
> [s6-ftrigrd] <defunct>
> 0     0   285     1 20   0    168     4 poll_s S+   pts/0      0:00
> s6-sleep -m -- 10000
> 4     0   292     0 20   0   5992 3760 do_wai Ss   pts/1      0:00 bash
> 4     0   300   292 20   0   7568 3080 -      R+   pts/1      0:00 ps axl

You can see:

- The managed processes are "thttpd" and "exrouter"
- I bumped the timeouts to 10000ms for the above tests
- When s6-svscan decides to exit, it sends signals to all the managed
processes, and the s6-supervised processes exit but the two managed
processes become zombies and aren't reaped
- Timing out still doesn't kill thttpd or exrouter (although it does
kill bash, so I had to reconnect to gather the third "ps axl"

It's easy to cut the timeout to, say, 100ms, but I'd much rather have a
correct shutdown sequence, as that's why I switched to s6 in the first
place.

Any ideas?

Thanks,
Dan

-- 
Daniel T. Griscom
152 Cochrane Street, Melrose, MA 02176-1433
(781) 662-9447  griscom_at_suitable.com  http://www.suitable.com/

Received on Wed Jul 21 2021 - 21:19:26 CEST

This archive was generated by hypermail 2.4.0 : Wed Jul 21 2021 - 21:20:00 CEST