Re: Service watchdog from Petr Malat on 2021-10-21 (supervision)

From: Petr Malat <oss_at_malat.biz>
Date: Thu, 21 Oct 2021 11:20:05 +0200

Hi!

> > Yes, in my usecase this would be used at the place where sd_notify()
> > is used if the service runs under systemd. Then periodically executed
> > watchdog could check the service makes progress and react if it
> > doesn't.
>
> If a single notification step is enough for you, i.e. the service
> goes from a "preparing" state to a "ready" state and remains ready
> until the process dies, then what you want is implemented in the s6
> process supervisor: https://skarnet.org/software/s6/notifywhenup.html
>
> Then you can synchronously wait for service readiness
> (s6-svwait $service) or, if you have a watchdog service, periodically
> poll for readiness (s6-svstat -r $service).
>
> But that's only valid if your service can only change states once
> (from "not ready" to "ready"). If you need anything more complex, s6
> won't support it intrinsically.
No, I need to monitor the service is alive - my watchdog script would
test if the age of the status message is older than a defined threshold
in which case it would kill the service (and the rest would be handled
in finish script).

> The reason why there isn't more advanced support for this in any
> supervision suite (save systemd but even there it's pretty minimal)
> is that service states other than "not ready yet" and "ready" are
> very much service-dependent and it's impossible for a generic process
> supervisor to support enough states for every possible existing service.
> Daemons that need complex states usually come with their own
> monitoring software that handles their specific states, with integrated
> health checks etc.
>
> So my advice would be:
> - if what you need is just readiness notification, switch to s6.
> It's very similar to runit and I think you'll find it has other
> benefits as well. The drawback, obviously, is that it's not in busybox
> and the required effort to switch may not be worth it.
> - if you need anything more complex, you can stick to runit, but you
> will kinda need to write your own monitor for your daemon, because
> that's what everyone does.
>
> Depending on the details of the monitoring you need, the monitoring
> software can be implemented as another service (e.g. to receive
> heartbeats from your daemon), or as a polling client (e.g. to do
> periodic health checks). Both approaches are valid.
That's what I thought of as well, but having this completely out of the
runsv can lead to a possible race window when the watchdog can kill a
service, which has restarted itself. This could be avoided if the check
would be serialized with other steps (run/finish execution) within
runsv. So far the futile restart of the service doesn't seem to cause
problems to me, so I'm not much bothered with it.

> Don't hack on runit, especially the control pipe thing. It will not
> end well.
> (runit's control pipe feature is super dangerous, because it allows a
> service to hijack the control flow of its supervisor, which endangers
> the supervisor's safety. That's why s6 does not implement it; it
> provides similar - albeit slightly less powerful - control features
> via ways that never give the service any power over the supervisor.)
The main reason I wanted to use the service pipe for it was a possibility
to see the service status in the process tree, which would be a nice
benefit.

BR,
Petr
Received on Thu Oct 21 2021 - 11:20:05 CEST

This archive was generated by hypermail 2.4.0 : Thu Oct 21 2021 - 11:21:11 CEST