Re: interesting claims from Laurent Bercot on 2019-05-16 (supervision)

From: Laurent Bercot <ska-supervision_at_skarnet.org>
Date: Thu, 16 May 2019 08:32:08 +0000

>The Question: As a newbie outsider I wonder, after following the
>discussion of supervision and tasks on stages (1,2,3), that there is a
>restrictive linear progression that prevents reversal. In terms of pid1
>that I may not totally understand, is there a way that an admin can
>reduce the system back to pid1 and restart processes instead of taking
>the system down and restarting? If a glitch is found, usually it is
>corrected and we find it simple to just do a reboot. What if you can
>fix the problem and do it on the fly. The question would be why (or why
>not), and I am not sure I can answer it, but if you theoretically can do
>so, then can you also kill pid2 while pid10 is still running. With my
>limited vision I see stages as one-way check valves in a series of fluid
>linear flow.

I'm not sure I understand your question, but I think there are
really two different questions here; I'll try to reformulate them,
correct me if I'm wrong.

1. Is booting a system a linear process where every step is
reversible?
2. Is it possible to restart a system "from scratch" without
rebooting?

The answer to both questions is "not really, but it doesn't matter".

We've been talking a lot about stages 1, 2 and 3 (and sometimes 4)
lately because I've been working on s6-linux-init, which focuses on
booting and especially on stage 1. But it's a very narrow, very
specific thing to focus on. Stage 1 is a critical part of the booting
process, obviously, and has to be done right, but once it is, you
can basically forget about it.

Most of the machine's lifetime, including most of the booting
sequence, happens in stage 2. Stage 1 is just early preparation, the
very basic minimum things you should be able to assume, such as
"there is a supervision tree running and I can add services to it";
for all intents and purposes, stage 2 is where you will be working,
even if your focus is to bring the machine up, e.g. if you're writing
a service manager.

Stage 1 isn't reversible; once it's done, you never touch it again,
you don't need to "reverse" it. It would be akin to also unloading
the kernel from memory before shutting down - it's just not necessary.

Stage 2 is where things happen. But what happens in stage 2 isn't
really reversible either: there is still a certain amount of one-time
initialization that needs to be done at boot time and doesn't need to
be undone at shutdown time. Booting and shutting down can be made
symmetric up to a point, but never entirely; the most obvious example
is mounting filesystems. There is a point in the boot sequence where
the filesystems are mounted; however, *unmounting* filesystems cannot
be done at the symmetrical point in the shutdown sequence - it has to
be done at the very end of the boot sequence, in stage 4, right before
the power goes off. Why? Because during shutdown, you may still have
user processes running, that prevent filesystems from being unmounted,
so you can only unmount filesystems after killing everything, which
happens at the end. Whereas during the boot sequence, you don't have
random user processes yet, you have a much more controlled
environment.
Booting and shutting down can't be made 100% symmetric. But that's
not a problem, because *symmetry is not a goal*. The goal of the
boot sequence is to make the machine operational; the goal of the
shutdown sequence is to make sure the plug can be pulled without
causing problems.

Symmetry makes sense in a service manager, because it helps to
see a service as being "up" or "down", and there is a hierarchy of
dependencies between services that make it natural to bring services
"up" or "down" in a certain, reversible order. But service management
isn't all there is, and in the bigger picture, a machine's lifetime
isn't perfectly symmetrical. And that's okay.

As for restarting a system from scratch without rebooting, the
question is what you want to achieve.

- If you want to be able to go through the whole shutdown procedure
with bringing down services etc. but *not* the actual hardware reboot,
and bringing up the whole system again from pid 1, yes, it is
theoretically possible, but not particularly useful. The shutdown
procedure is designed to make the system ready for poweroff, and it's
quite a waste if you're not going to poweroff. The boot procedure is
designed to get the system from a just-powered-on state to a fully
operational state, and it's also quite a waste if the system is
already fully operational. There aren't many problems which doing
this is the right solution to.

- If you want to kill every process but pid 1 and have the system
reconstruct itself from there, then yes, it is possible, and that is
the whole point of having a supervision tree rooted in pid 1. When
you kill every process, the supervision tree respawns, so you always
have a certain set of services running, and the system can always
recover from whatever you throw at it. Try it: grab a machine with
a supervision tree and a root shell, run "kill -9 -1", see what
happens.

--
Laurent

Received on Thu May 16 2019 - 08:32:08 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:44:19 UTC