Re: "Back off" setting for crashing services with s6+openrc? from Laurent Bercot on 2022-09-30 (supervision)

From: Laurent Bercot <ska-supervision_at_skarnet.org>
Date: Fri, 30 Sep 2022 13:21:06 +0000

  I feel like this whole thread comes from mismatched expectations of
how s6 should behave.

  s6 always waits for one second before two successive starts of a
service. This ensures it never hogs the CPU by spamming a crashing
service. (With an asterisk, see below.)

  It does not wait for one second if the service has been started for
more than one second. The idea is, if the service has been running for
a while, and dies, you want it up _immediately_, and it's okay because
it was running for a while, and either was somewhat idle (in which case
you're not hurting for resources) or was hot (in which case you don't
want a 1-second delay).

  The point of s6 is to maximize service uptime; this is why it does
not have a configurable backoff mechanism. When you want a service up,
you want it *up*, and that's the #1 priority. A service that keeps
crashing is an abnormal condition and is supposed to be handled by the
admin quickly enough - ideally, before getting to production.

  If the CPU is still being hogged by s6 despite the 1-second delay
and it's not that the service is running hot, then it means it's
crashing while still initializing, and the initialization takes more
than one second while using a lot of resources. In other words, you
got yourself a pretty heavy service that is crashing while starting.

  That should definitely be caught before it goes in production. But
if it's not possible, the least ugly workaround is indeed to sleep
in the finish script, and increasing timeout-finish if needed.
(The "run_service || sleep 10" approach leaves a shell between
s6-supervise and run_service, so it's not good.)
./finish is generally supposed to be very short-lived, because the
"finishing" state is generally confusing to an observer, but in this
case it does not matter: it's an abnormal situation anyway.

  There is, however, one improvement I think I can safely make.
  Currently, the 1-second delay is computed from when the service
*starts*:
if it has been running for more than one second, and crashes, it
restarts
immediately, even if it has only been busy initializing, which causes
the resource hog OP is experiencing.
  I could change it to being computed from when the service is *ready*:
if the service dies before being ready, s6-supervise *always* waits for
1 second before restarting. The delay is only skipped if the service has
been *ready* for 1 second or more, which means it really serving and
either idle (i.e. chill resource-wise) or hot (i.e. you don't want to
delay it).
  Does that sound like a valid improvement to you folks?

  Note that I don't think making the restart delay configurable is a good
trade-off. It adds complexity, size and failure cases to the
s6-supervise
code, it adds another file to a service directory for users to remember,
it adds another avenue for configuration mistakes causing downtime, all
that to save resources for a pathological case. The difference between
0 second and 1 second of free CPU is significant; longer delays have
diminishing returns.

--
  Laurent

Received on Fri Sep 30 2022 - 15:21:06 CEST

This archive was generated by hypermail 2.4.0 : Fri Sep 30 2022 - 15:21:37 CEST