Re: [s6-rc] How to handle longrun failures from Laurent Bercot on 2017-03-02 (supervision)

From: Laurent Bercot <ska-supervision_at_skarnet.org>
Date: Thu, 02 Mar 2017 15:00:27 +0000

>Using s6-rc, I am not sure how to handle longrun failures. Say I have a
>daemon which fails to start (e.g. missing library, cannot read its
>config...). I don't want to start it again.

  It sounds like you don't want to supervise this daemon. In that case,
run it as a oneshot that backgrounds itself, and make sure the parent
exits nonzero if the child doesn't succeed.
  But if you do want to supervise it, keep reading:

> For oneshot transitions the return code determines whether the
>transition is successful or not. For longruns I see the only reason for
>an up transition to fail is a timeout on readiness notification.
>However I do not want to use a timeout in this case. Typically, in the
>finish script of a longrun service, I would like to decide, based on
>the return code or signal number, to put the service down.

  That makes sense, and it's possible to do it at the s6 level (just call
s6-svc -d . in the finish script). However, from the s6-rc point of
view, you have asked a supervised service to transition from down to up,
so it will not stop trying until the service is actually up or it
times out.

  My advice for now would be to:
1. write your ./finish script with a s6-svc -d when you want to stop
restarting the daemon
2. set a reasonable timeout-up value in your s6-rc definition, so when
the
daemon fails and ./finish tells s6 to stop restarting it, the
notification
never arrives and s6-rc eventually times out and gives up. It's kind of
ugly, but it's the best you can do for now.

  I will think about implementing a way for s6 to tell s6-rc to fail a
longrun transition instantly, without waiting for a timeout. It's a good
idea, thanks for mentioning it.

--
  Laurent

Received on Thu Mar 02 2017 - 15:00:27 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:44:19 UTC