Re: patch: sv check should wait when svrun is not ready from Avery Payne on 2015-02-18 (supervision)

From: Avery Payne <avery.p.payne_at_gmail.com>
Date: Tue, 17 Feb 2015 16:20:03 -0800

On 2/17/2015 11:02 AM, Buck Evan wrote:
> I think there's only three cases here:
>
> 1. Users that would have gotten immediate failure, and no amount of
> spinning would help. These users will see their error delayed by
> $SVWAIT seconds, but no other difference.
> 2. Users that would have gotten immediate failure, but could have
> gotten a success within $SVWAIT seconds. All of these users will of
> course be glad of the change.
> 3. Users that would not have gotten immediate failure. None of these
> users will see the slightest change in behavior.
>
> Do you have a particular scenario in mind when you mention "breaking
> lots of existing installations elsewhere due to a default behavior
> change"? I don't see that there is any case this change would break.

I am not so much thinking of a specific case as I am looking at it from
an integration perspective. I ask that you indulge me for a moment, and
let me diverge from the discussion so I can clarify things.

My background is in maintaining business software. My employer has the
source code to their ERP system and I make large and small modifications
to adapt to changing business needs. During the process of working on
"legacy code" in a "legacy language", I have to be mindful that there
are side-effects to each change; I have to look at it from a viewpoint
of "what is everything else in the system expecting when this code is
called". This means thinking in terms of code-as-API, so that calls
elsewhere don't break. Yes, I am aware of unit tests, etc., but trust
me when I say it's not an option for the environment. So that means
lots and lots of careful testing by hand, and being very mindful of how
things fit together.

With that viewpoint in mind, let's turn back to my words, which were
admittedly overstated. When I said "breaking lots of existing
installations" I was trying to describe a point of view, for I was
looking at it from a pragmatic standpoint of "if there is code out there
that expects behavior X, but is given behavior Y, then the probability
of something breaking increases". From my point of view, when you run
"sv check (something)", that's no different that making an API call
because "sv check (something)" typically happens inside of a script,
which in turn implies a language and environment. The behavior of the
"sv check" call, and specifically the side-effects of it, are taken into
consideration "elsewhere"; I can't say where else because I can't see
specific installations, and it's entirely possible that there is
*nothing* out there that would be broken, and I'm writing this all for
naught. But the point remains - the API is set, the behavior of the
"call" is set, and deviating from that requires that everyone downstream
make changes to ensure that their scripts don't break.

So, I think it becomes a question of "can I guarantee that the side
effect created by the change will not adversely impact something else,
since I can't directly observe what will be impacted?" Which is why I
suggested the option switch. Introducing a new switch means that the
existing behavior will be kept, but we can now use the new behavior by
explicitly asking for it. In effect, we're extending our API without
breaking existing "calls from legacy code".

The only example I could give would be my own project at the moment,
although what follows is admittedly a weak argument. Blocking-on-check
would pretty much destroy the script work I've done for peer-based
dependency management, because a single dependency would cause the
"parent" service to hang while it waited for the "child" to come up.
This happens because the use of "sv check (child)" follows the
convention of "check, and either succeed fast or fail fast", and the
parent's script is written with the goal of exiting out because of a
child's failure. Each fail-to-start is logged in the parent, so it's
clear that parent service X failed because child service Y was not
confirmed as running. Without that fast-fail, the logged hint never
occurs; the sysadmin now has to figure out which of three possible
services in a dependency chain are causing the hang. While this is
implemented differently from other installations, there are known cases
similar to what I am doing, where people have ./run scripts like this:

#!/bin/sh
sv check child-service || exit 1
exec parent-service

A secondary example would be that the existing set of scripts in the
project are written with an eye towards supporting three environments,
which is possible due to their similar behavior. This consistency makes
the project possible for daemontools and s6, as well as runit. A change
in runit's behavior implies that I can no longer rely on that consistency.

Perhaps I am understanding the environment clearly but misunderstanding
the intent of the change. If I am not grokking your intentions, just
send a short note to the effect of "sorry, wrong idea" and I'll stop. :)
Received on Wed Feb 18 2015 - 00:20:03 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:44:19 UTC