Re: s6: something like runit's ./check script from Colin Booth on 2015-09-08 (supervision)

From: Colin Booth <cathexis_at_gmail.com>
Date: Tue, 8 Sep 2015 10:38:17 -0700

On Tue, Sep 8, 2015 at 9:29 AM, Buck Evan <buck_at_yelp.com> wrote:
> Putting them side by side for my own benefit, and normalizing Colin's
> terminology and formatting:
>
> It looks to me like the only notable difference is Colin's 'fdmove -c 2 1'.
> If I understand it, this is redirecting stderr to stdout, which would be out
> of scope for a generalized polling helper. I assume it was written this way
> because he combined the run and check scripts.
>
Exactly. The run script was written and I needed to get it to support
notification for purposes of booting under s6-rc. That's also why I
used loopwhilex - I really need udev functional before other stuff
starts, regardless of how long it takes.
>
> Open questions:
>
> 1) How long should the polling last? The forever-polling seems like it would
> cause ever more orphaned polling processes as the service is restarted. The
> runit precedent is seven seconds with an override. I think a start-timeout
> file would be in line with current s6 design.
>
Forever polling only leaves as many orphaned processes as you fail to
clean up since whatever you have in your check script should exit
(either immediately or when the service starts if it's a blocking
script). There's still only one s6-ftrig-listen spawned by
s6-supervise and that goes away after it sees the U.

With regards to timeout markers, my suggestion is to stick it in env/
or data/ and fish it out on a per-case basis, though env is better
since you can use s6-envdir like it's intended. Mostly because this is
a per-service thing and not part of the supervisor, but also because
it's one less thing that s6-rc needs to track.
>
> 2) What should the poller do on timeout? Laurent's implementation would give
> up quietly, and the service would simply never reach the up-and-ready (U)
> state. I personally find this unacceptable. I'd like svstat to show that the
> service is in a bad state in this case, although I don't know if the concept
> of "bad state" currently exists in s6.
>
Bad state doesn't exist in s6. I believe the state will be left at u
and not U. To be fair, runit does something worse when the timeout
fails:
colinb_at_colinb1:~/tmp/service/test$ sv start .
timeout: run: .: (pid 21907) 7s
colinb_at_colinb1:~/tmp/service/test$ sv status .
run: .: (pid 21907) 10s

The only thing it does is have `sv start' exit non-zero.

At least with a baked in timeout inside of your run script you can
have failure events trigger other events (such as mailing or firing
off s6-svc commands). I'm not sure if s6-supervise itself understands
the concept of timeouts but s6-ftrig-wait does, so anything using that
to detect status (s6-svc, s6-rc, custom in-house scripts) should be
able to timeout, fail, alert/restart/modify-the-service/etc as
necessary, in addition to any bookkeeping you do inside the run script
itself. And, honestly, if you expect your service to be in U and its
still in u after a reasonable chunk of time, that's your indication of
bad state.

Cheers!

-- 
"If the doors of perception were cleansed every thing would appear to
man as it is, infinite. For man has closed himself up, till he sees
all things thru' narrow chinks of his cavern."
  --  William Blake

Received on Tue Sep 08 2015 - 17:38:17 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:44:19 UTC