Re: [s6-rc] How to handle longrun failures from Van Bemten, Lionel (Nokia

From: Van Bemten, Lionel (Nokia - BE) <"Van>
Date: Thu, 2 Mar 2017 16:24:37 +0000

I do want to supervise it, e.g. to restart it when SIGSEGV, but not
when a library is missing.

My daemon does not have readiness notification, so s6-rc considers
the transition to be successful. I do s6-svc -d . in the finish script, so
the daemon is not restarted by s6-supervise, but s6-rc lists it as "up".
To get s6-rc back to a coherent state, I need to call "s6-rc -d change svc",
but I first need to wait that s6-rc has finished its pending transition.

Basically when s6-rc reads 'd' or 'D' on the fifodir it could check whether
the service is still up or not (s6-svstat). However I am not sure if it is
acceptable that s6-rc receives such state changes from the outside:
what should it do with the dependencies then? Bring them down?
For applications that collaborate (i.e. with readiness notification) you
can probably do that, because depending services are not yet started,
but for others, it seems hazardous. Maybe for non-collaborating
daemons, up transitions should be considered successful only if the
daemon stays up for 1 second. Sounds awful at first but thinking about
it, it may not be such a bad idea...

Kr,
Lionel
________________________________________
From: supervision_at_list.skarnet.org <supervision_at_list.skarnet.org> on behalf of Laurent Bercot <ska-supervision_at_skarnet.org>
Sent: Thursday, March 2, 2017 4:00:27 PM
To: supervision_at_list.skarnet.org
Subject: Re: [s6-rc] How to handle longrun failures

>Using s6-rc, I am not sure how to handle longrun failures. Say I have a
>daemon which fails to start (e.g. missing library, cannot read its
>config...). I don't want to start it again.

  It sounds like you don't want to supervise this daemon. In that case,
run it as a oneshot that backgrounds itself, and make sure the parent
exits nonzero if the child doesn't succeed.
  But if you do want to supervise it, keep reading:

> For oneshot transitions the return code determines whether the
>transition is successful or not. For longruns I see the only reason for
>an up transition to fail is a timeout on readiness notification.
>However I do not want to use a timeout in this case. Typically, in the
>finish script of a longrun service, I would like to decide, based on
>the return code or signal number, to put the service down.

  That makes sense, and it's possible to do it at the s6 level (just call
s6-svc -d . in the finish script). However, from the s6-rc point of
view, you have asked a supervised service to transition from down to up,
so it will not stop trying until the service is actually up or it
times out.

  My advice for now would be to:
1. write your ./finish script with a s6-svc -d when you want to stop
restarting the daemon
2. set a reasonable timeout-up value in your s6-rc definition, so when
the
daemon fails and ./finish tells s6 to stop restarting it, the
notification
never arrives and s6-rc eventually times out and gives up. It's kind of
ugly, but it's the best you can do for now.

  I will think about implementing a way for s6 to tell s6-rc to fail a
longrun transition instantly, without waiting for a timeout. It's a good
idea, thanks for mentioning it.

--
  Laurent

Received on Thu Mar 02 2017 - 16:24:37 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:44:19 UTC