Re: s6 bites noob from Laurent Bercot on 2019-02-03 (supervision)

From: Laurent Bercot <ska-supervision_at_skarnet.org>
Date: Sun, 03 Feb 2019 10:19:26 +0000

>s6-supervise aborts on startup if foo/supervise/control is already open, but perpetually retries if foo/run doesn't exist. Both of those problems indicate the user is doing something wrong. Wouldn't it make more sense for both problems to result in the same behavior (either retry or abort, preferably the latter)?

foo/supervise/control being already open indicates there's already a
s6-supervise process monitoring foo - in which case spawning another
one makes no sense, so s6-supervise aborts.

foo/run not existing is a temporary error condition that can happen
at any time, not only at the start of s6-supervise. This is a very
different case: the supervisor is already running and the user is
relying on its monitoring foo. At that point, the supervisor really
should not die, unless explicitly asked to; and "nonexistent foo/run"
is perfectly recoverable, you just have to warn the user and try
again later.

It's simply the difference between a fatal error and a recoverable
error. In most simple programs, all errors can be treated as fatal:
if you're not in the nominal case, just abort and let the user deal
with it. But in a supervisor, the difference is important, because
surviving all kinds of trouble is precisely what a supervisor is
there for.

>https://cr.yp.to/daemontools/supervise.html indicates the original verison of supervise aborts in both cases.

That's what it suggests, but it is unclear ("may exit"). I have
forgotten what daemontools' supervise does when foo/run doesn't
exist, but I don't think it dies. I think it loops, just as
s6-supervise does. You should test it.

> I also don't understand the reason for svscan and supervise being different. Supervise's job is to watch one daemon. Svscan's job is to watch a collection of supervise procs. Why not omit supervise, and have svscan directly watch the daemons? Surely this is a common question.

You said it yourself: supervise's job is to watch one daemon, and
svscan's job is to watch a collection of supervise processes. That is
not the same job at all. And if it's not the same job, a Unix guideline
says they should be different programs: one function = one tool. With
experience, I've found this guideline to be 100% justified, and
extremely useful.
Look at s6-svscan's and s6-supervise's source code. You will find
they share very few library functions - there's basically no code
duplication, no functionality duplication, between them.

Supervising several daemons from one unique process is obviously
possible. That's for instance what perpd, sysvinit and systemd do.
But if you look at perpd's source code (which is functionally and
stylistically the closest to svscan+supervise) you'll see that
it's almost as long as the source code of s6-svscan plus s6-supervise
combined, while not being a perfectly nonblocking state machine as
s6-supervise is.

Combining functionality into a single process adds complexity.
Putting separate functionality in separate processes reduces
complexity, because it takes advantage of the natural boundaries
provided by the OS. It allows you to do just as much with much less
code.

>I understand svscan must be as simple as possible, for reliability, because it must not die. But I don't see how combining it with supervise would really make it more complex. It already has supervise's functionality built in (watch a target proc, and restart it when it dies).

No, the functionality isn't the same at all, and "restart a process
when it dies" is an excessively simplified view of what s6-supervise
does. If that was all there is to it, a "while true ; do ./run ; done"
shell script would do the job; but if you've had to deal with that
approach once in a production environment, you intimately and
painfully know how terrible it is.

s6-svscan knows how s6-supervise behaves, and can trust it and rely
on an interface between the two programs since they're part of the
same package. Spawning and watching a s6-supervise process is easy,
as easy as calling a function; s6-svscan's complexity comes from the
fact that it needs to manage a *collection* of s6-supervise
processes. (Actually, the brunt of its complexity comes from supporting
pipes between a service and a logger, but that's beside the point.)

On the other hand, s6-supervise does not know how ./run behaves, can
make no assumption about it, cannot trust it, must babysit it no matter
how bad it gets, and must remain stable no matter how much shit it
throws at you. This is a totally different job - and a much harder job
than watching a thousand of nice, friendly s6-supervise programs.
Part of the proof is that s6-supervise's source code is bigger than
s6-svscan's.

By all means, if you want a single supervisor for all your services,
try perp. It may suit you. But I don't think having less processes
in your "ps" output is a worthwhile goal: it's purely cosmetic, and
you have to balance that against the real benefits that separating
processes provides.

--
Laurent

Received on Sun Feb 03 2019 - 10:19:26 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:44:19 UTC