Re: Kill process group after a timeout window

From: Hoël Bézier <hoelbezier_at_riseup.net>
Date: Wed, 5 Nov 2025 18:17:34 +0100

Hi,

Am Mi, Nov 05, 2025 am 01:32:47 +0000 schrieb Becker, Tom via supervision:
>Hi all,
>
>I recently encountered an issue while using s6-svc to terminate a service during a restart procedure. I would like to ask for your help in case I have misconceptions about the intended s6 usage or if you have suggestions on how to resolve the issue.
>
>The service is a worker node in a cluster of compute nodes. It runs in a docker container with s6 as the init script.
>Occasionally the node will reach an invalid state such as when it can no longer connect to the rest of the cluster or when the computation encounters OOMs.
>We use a python script to poll for these invalid states and call the restart procedure.
>The body of the restart procedure looks like this (error handling is omitted):
>
>def restart_daemon():
>      # kill demon gracefully
>      system("/opt/s6/bin/s6-svc -T 180000 -wd -d /projectname/service/servicename")
>      # kill demon
>      system("/opt/s6/bin/s6-svc -k /projectname/service/servicename")
>      # restart demon
>      system("/opt/s6/bin/s6-svc -T 180000 -wu -u /projectname/service/servicename")
>
>The idea is to give the daemon a chance to end itself gracefully (potentially write out some logs and log out of the cluster) before assuredly killing it. After killing it a new daemon is started. This usually works fine.
>We encountered an issue where the daemon would repeatedly try to restart but wasn’t able to. It turns out the daemon had left behind a stray child process which was messing up its starting procedure. This was a rare occurrence but is not unheard of.
>I upgraded s6 to the most recent version 2.13.2.0, which supports the “-K” option (for kill the whole process group) and changed the corresponding line in the function.
>In order to test the change I replaced the demon with a process that does nothing but spawn a child process which does nothing except sleep endlessly.
>When executing the s6-svc commands in sequence the daemon’s child did not, however, get killed. It survived and with every restart a new one was made, which proves that this solution does not resolve the issue.
>I haven’t looked into the s6 code yet but I am guessing that the reason is that if the daemon manages to shutdown gracefully (which is normally the case), then by the time the process group is ordered to be killed s6 does not remember the process group that the service used to have.
>I considered storing the process group of the daemon in a file somewhere and sending the signal in the monitoring script but that comes with an array of problems that s6 was designed to prevent in the first place. I believe that the issue is that s6-svc does not come with an option that combines the graceful and graceless kill commands. Ideally we could send SIGTERM with a timeout and the send SIGKILL to the whole process group in one operation so that the process group is not forgotten. Do you have any suggestions?

Well, I believe that the issue you have is that your daemon does not kill its
child when exiting gracefully. :p

This seems to be a bug of your daemon, as it prevents it from starting again.

To fix this, you can write a ./finish script in your service directory that
sends a SIGKILL to the process group (if the daemon exited gracefully, we
should be able to consider that any process left in this process group is buggy
and can be safely killed without mercy). You can read the process group in the
fourth argument given to your ./finish script.

You can read more about the ./finish script on this page[0].

0: https://skarnet.org/software/s6/servicedir.html

Ideally of course, you’d fix your daemon, but maybe you’re not the one writing
it, so this seems to me like the best workaround you have. :)

I don’t think the solution you’re proposing would work, because s6-supervise
has no knowledge that your daemon has childs. So if we could do what you asked,
it would send SIGTERM to your daemon, your daemon would exit gracefully,
leaving a child behind, and s6-supervise would never send SIGKILL to the
process group because it thinks the daemon exited gracefully (which it did).

Sending SIGTERM to the whole process group at once would not be fitting either,
because your daemon’s childs might not exit normally on SIGTERM and should be notified
differently (SIGHUP if they’re reading from a pipe for instance, or some other IPC
mechanism used by your daemon for communicating with them). s6-supervise can not
know that, and actually your daemon is the only one that can, which is why I was
speaking about fixing it earlier.

However, if your daemon exited gracefully, we can safely consider that stray
processes should not exist, and kill them. This is why I proposed you do that,
and with s6 the proper place to do this is the ./finish script.

Good luck with your misbehaving daemon. :)
Hoël
Received on Wed Nov 05 2025 - 18:17:34 CET

This archive was generated by hypermail 2.4.0 : Wed Nov 05 2025 - 18:18:12 CET