On 16/07/2015 19:22, Colin Booth wrote:
> You're right, ./run is up, and being in ./finish doesn't count as up.
> At work we use a lot of runit and have a lot more services that do
> cleanup in their ./finish scripts so I'm more used to the runit
> handling of down statuses (up for ./run, finish for ./finish, and down
> for not running). My personal setup, which is pretty much all on s6
> (though migrated from runit), only has informational logging in the
> ./finish scripts so it's rare for my services to ever be in that
> interim state for long enough for anything to notice.
I did some analysis back in the day, and my conclusion was that
admins really wanted to know whether their service was up as opposed
to... not up; and the finish script is clearly "not up". I did not
foresee a situation like a service manager, where you would need to
wait for a "really down" event.
> As for notification, maybe 'd' for when ./run dies, and 'D' for when
> ./finish ends. Though since s6-supervise SIGKILLs long-running
> ./finish scripts, it encourages people to do their cleanup elsewhere
> and as such removes the main reason why you'd want to be notified on
> when your service is really down. If the s6-supervise timer wasn't
> there, I'd definitely suggest sending some message when ./finish went
> away.
Yes, I've gotten some flak for the decision to put a hard time limit
on ./finish execution, and I'm not 100% convinced it's the right
decision - but I'm almost 100% convinced it's less wrong than just
allowing ./finish to block forever.
./finish is a destroyer, just like close() or free(). It is nigh
impossible to define sensical semantics that allow a destroyer to fail,
because if it does, then what do you do ? void free() is the right
prototype; int close() is a historical mistake.
Same with ./finish ; and nobody tests ./finish's exit code and that's
okay, but since ./finish is a user-provided script, it has many more
failure modes than just exiting nonzero - in particular, it can hang
(or simply run for ages). The problem is that while it's alive, the
service is still down, and that's not what the admin wants.
Long-running ./finish scripts are almost always a mistake. And that's
why s6-supervise kills ./finish scripts so brutally.
I think the only satisfactory answer would be to leave it to the user :
keep killing ./finish scripts on a short timer by default, but have
a configuration option to change the timer or remove it entirely. And
with such an option, a "burial notification" when ./finish ends becomes
a possibility.
> Ah, gotcha. I was sending explicit timeout values in my s6-rc comands,
> not using timeout-up and timeout-down files. Assuming -tN is the
> global value, then passing that along definitely makes sense, if
> nothing else than to bring its behavior in-line with the behavior of
> timeout-up and timeout-down.
Those pesky little s6-svlisten1 processes will get nerfed.
> Part of my job entails dealing with development servers where
> automatic deploys happen pretty frequently but service definitions
> dont change too often. So having non-privileged access to a subsection
> of the supervision tree is more important than having non-privileged
> access to the pre- and post- compiled offline stuff.
I understand. I guess I can make s6-rc-init and s6-rc 0755 while
keeping them in /sbin, where Joe User isn't supposed to find them.
> By the way, that's less secure than running a full non-privileged
> subtree.
Oh, absolutely. It's just that a full setuidgid subtree isn't very
common - but for your use case, a full user service database makes
perfect sense.
--
Laurent
Received on Thu Jul 16 2015 - 22:16:00 UTC