Re: [s6-rc] How to handle longrun failures

From: Laurent Bercot <ska-supervision_at_skarnet.org>
Date: Thu, 02 Mar 2017 23:01:16 +0000

>My daemon does not have readiness notification, so s6-rc considers
>the transition to be successful. I do s6-svc -d . in the finish script,
>so
>the daemon is not restarted by s6-supervise, but s6-rc lists it as
>"up".
>To get s6-rc back to a coherent state, I need to call "s6-rc -d change
>svc",
>but I first need to wait that s6-rc has finished its pending
>transition.

  As a temporary workaround, you could for instance set up a 2 second
timeout, a notification-fd file (containing, say, 3), and change your
run script to something like:

background
{
   if { sleep 1 }
   if { pipeline { s6-svstat } grep -q ^up }
   fdmove 1 3
   echo
}
fdclose 3
your-real-run-script

  so that if your daemon is up after 1 second, the service will be
considered ready, and if it is not, s6-rc will timeout after 2 seconds
and consider the transition failed.


>Basically when s6-rc reads 'd' or 'D' on the fifodir it could check
>whether
>the service is still up or not (s6-svstat). However I am not sure if it
>is
>acceptable that s6-rc receives such state changes from the outside:
>what should it do with the dependencies then? Bring them down?

  s6-rc does nothing with dependencies in a single run. When it sees that
a transition fails, it just marks it as failed, and keeps working on
remaining available transitions. It exits when there is no more work
that
it can do without retrying.
  If it exits 1, then some transitions failed. It is then up to the user
to
retry - or to perform appropriate actions: "s6-rc -a list" shows the
list
of active services, so it's possible to know what should be up but is
not.

  It is very much intentional that there are two distinct notions of
state:
  - the state of the process, handled by s6
  - the state of the service, handled by s6-rc
  The process state is temporary, the service state is permanent (until
a new s6-rc invocation, which may or may not change the service state
depending on whether a transition is requested and succeeds).


> Maybe for non-collaborating
>daemons, up transitions should be considered successful only if the
>daemon stays up for 1 second.

  Yes, and it can be achieved by kludging the run script as above ;)
  I definitely don't want to make it official because it's unreliable
and daemons should ideally provide readiness notification, but it's an
existing possibility for users.

--
  Laurent
Received on Thu Mar 02 2017 - 23:01:16 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:44:19 UTC