Re: taxonomy of dependencies

From: post-sysv <>
Date: Sun, 07 Jun 2015 21:22:39 -0400

Hm, correct. Such issues of application interdependency failure are
quite inherent to schemes where observing and handling of state is
external to the program's data structures (as is the case of OS process
supervision) and programs being monitored are not defensively designed,
so system-wide maneuvers like superservers, fd-holding and readiness
notification serve as useful techniques for synchronization and
consistency to a degree. I still remain convinced that crash-only
software was a key insight of Erlang, engineering systems so that
restarts to a known good state is feasible in of itself because
components are discrete and trying to intervene on internal program
state becomes an exponential nightmare.

Though, when managing OS processes, the semantics differ. Where launchd
is concerned, it obviously supports acting as a socket superserver, and
that much is used. But its plists also have a key called KeepAlive that
can be instructed to operate on a job state, one of NetworkState,
PathState or OtherJobEnabled. This might have been useful in your
scenario, as it does let services create semaphore conditions for common
events surrounding the environment. NetworkState is defined as any
non-loopback device having an IP(v4/v6) address assigned to it, the
other two to a hard requirement on a file system watch handle and a
registered job, respectively.

This is a tough issue, but as I stated previously regarding dependency
systems, the approach of having a few common semaphores for high-level
system events and using them for synchronization in case of macro level
service interactions might be a less complicated approach to avoid the
scheduling and transactional overheads of full dependency management,
which in the process introduces problems of dependency loops and so on.
Though again, you run into the problem of codifying state, where
complexity space is hard.

On 06/07/2015 08:56 PM, Jonathan de Boyne Pollard wrote:
> post-sysv:
>> Of course, your particular example would be made less gruesome simply
>> by introducing a rate limit on startup failures. This strategy seems
>> to be employed frequently in launchd setups.
> It's a standard daemontools thing, also. It's hardwired into the
> original "supervise".
> I had to get rid of it for nosh; again, because what's fine on a
> hobbyist PC isn't fine in a datacentre. In the daemontools-style
> avoid-restarting-too-often 1 second sleep that ensued whilst dnscache
> was doing a restart (to quickly clear the cache of a bogus DNS
> resource record set), application X processed several hundred
> transaction requests. Unfortunately, since application X was talking
> to dnscache over the loopback interface, the UDP/IP subsystem merrily
> informed the DNS client library that it couldn't reach port 53. (On a
> non-loopback interface, the ICMP messages would return too late.) And
> thus instead of waiting and retransmitting, the DNS client library
> immediately returned a failure to application X for all of that 1
> second's worth of requests.
> Sometimes, one does _not_ want these things. If it's doing a graceful
> restart, I want dnscache back up *right now*, not 1 second from now.
> Application X, whose rate of continual DNS lookups is why there's a
> local dnscache in the first place, needs as close to uninterrrupted
> DNS service as it can get, even in the face of system administrators
> who know that "we can just clear that problem out of the local cache
> and get things fixed today by killing the DNS server and letting it
> auto-restart, can't we?" and then terminate the service twice.
> What I have in nosh now is of course that this is user-configurable.
> You want a 1 second sleep? Put "sleep 1" in the "restart" script.
> You don't want one? Don't do that, then. You want to sleep in the
> event of a "bad" signal but restart immediately in the event of normal
> termination or a "good" signal? Use a case statement and the
> parameter passed to restart which encodes the process termination
> status. And so forth.
> And convert-systemd-units can thus write a "restart" script that does
> the range from that to (say) RestartSec=60 and Restart=on-abort, since
> the mechanism is flexible enough.
> But even a restart interval of 1 minute isn't enough to cope with the
> times when it takes rabbitmq-server a fair fraction of an hour to come
> up and the number of waiting clients is in 3 figures. Rate limits are
> a sticking plaster, not the anwer.
Received on Mon Jun 08 2015 - 01:22:39 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:44:19 UTC