Re: runit kill runsv from Laurent Bercot on 2016-06-23 (supervision)

From: Laurent Bercot <ska-supervision_at_skarnet.org>
Date: Thu, 23 Jun 2016 14:35:14 +0200

On 23/06/2016 03:46, Thomas Lau wrote:
> LOL, well I am trying to do drill test and see how resilience of runit
> could be, this is one of the minor downfall.

  Current supervisors have no way of knowing that they died and
their child is still running. Hence, when they start again, they attempt
to run their child again, which will probably fail since the old instance
of the child is still running. So, they will periodically try and start
the child again, only to fail again, and so on.
  On daemontools and s6, the period is 1 second. I'm not sure about runit,
but it should be around 1s too.

  Yes, it is a problem, and I don't like that behaviour much, but the
alternatives are actually worse. Currently, the consequences of the
issue are that when a supervisor dies and restarts:
  - depending on the run script, the daemon's logs are flooded with error
messages from the run script failing to exec into the daemon.
  - Every second, some CPU is used to try and start the daemon.

  I think those drawbacks are acceptable and trying to fix them is not a
good idea:

  - Supervisors dying without their daemons dying are an extremely rare
occurrence, not worth specialcasing unless it causes systemic, unrecoverable
failure which is not the case.
  - What we'd want ideally: the new instance of the supervisor would "grab"
the old instance of the daemon. But that is impossible under Unix, and
any attempt to do that is doomed to use the same hacks that non-supervision
systems use and that supervision aims to step away from.
  - Any attempt to kill the old instance of the daemon in order to properly
start a new supervised instance is a policy decision, which belongs to the
admin; the supervisor program can't make that decision automatically.
  - As is, even if the supervisor dies, the service keeps running; its in
"degraded mode" because the current instance isn't watched by a supervisor,
but it's still running, and that's what important. And if the daemon dies,
a new, supervised instance will automatically take its place, as if the
supervisor had never died: things will fix themselves on their own.
  - For critical services, the log flooding should trigger an alerting system
that will notify the admins that there's a problem, and appropriate action
can then be taken (i.e. either do nothing or kill the current instance of
the daemon).
  - The periodic attempt to start a new instance of the daemon is generally
not expensive. This is one of the reasons for the 1s respawning period: it
gives the system time to breathe, without the "respawning too fast" problem
that can be observed with, for instance, sysvinit. If the daemon uses a lot
of resources before it notices it cannot succeed, that's a design issue
in the daemon, not the supervisor; and even in that case, on critical
machines there should be an alerting system that notices the spike in
resource usage and notifies the admins.
  - Attempts to handle that edge case in the supervisor itself would add a lot
(a real whole lot) of complexity, for very uncertain benefits.

  So, yeah. Even if your logs freak out, your memcached is still running,
and that's what you want. And stop voluntarily killing your runsv for
testing purposes: the day when your runsv accidentally dies before the
daemon it's supervising is the day when something's seriously wrong with
your system and you have much bigger problems than spurious log messages.

-- 
  Laurent

Received on Thu Jun 23 2016 - 12:35:14 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:44:19 UTC