Re: taxonomy of dependencies from Laurent Bercot on 2015-06-08 (supervision)

From: Laurent Bercot <ska-supervision_at_skarnet.org>
Date: Mon, 08 Jun 2015 11:20:58 +0200

On 08/06/2015 02:56, Jonathan de Boyne Pollard wrote:
> I had to get rid of it for nosh; again, because what's fine on a
> hobbyist PC isn't fine in a datacentre. In the daemontools-style
> avoid-restarting-too-often 1 second sleep that ensued whilst dnscache
> was doing a restart (to quickly clear the cache of a bogus DNS
> resource record set), application X processed several hundred
> transaction requests. Unfortunately, since application X was talking
> to dnscache over the loopback interface, the UDP/IP subsystem merrily
> informed the DNS client library that it couldn't reach port 53. (On
> a non-loopback interface, the ICMP messages would return too late.)
> And thus instead of waiting and retransmitting, the DNS client
> library immediately returned a failure to application X for all of
> that 1 second's worth of requests.

  Sorry Jonathan, but this is just screaming unreliable design. If your
dnscache is so critical that you can't afford having it down for one
second, then why isn't there a backup ? Why don't you have several
"nameserver" lines in your /etc/resolv.conf, and when you restart one
of them, queries are still served by the other ones ?

  In datacenters, you do not ensure continuity of service by minimizing
process downtime (although this is of course a valuable secondary goal).
You ensure continuity of service by making sure it is not a problem at
all when a process goes down, and you give yourself a reasonable margin
of downtime for every process, which will help for outages as well as
rollouts.

  What is true for datacenters even more than for hobbyist PCs, however,
is that you definitely do not want cascading failure. And instant restart
is a recipe for cascading failure: if your dnscache cannot start for
some reason and dies instantly, and your supervisor restarts it
immediately, your CPU loses itself in that loop and now you have a whole
machine down instead of just one process down.

> Sometimes, one does _not_ want these things. If it's doing a
> graceful restart, I want dnscache back up *right now*, not 1 second
> from now.

  Not if it comes at the price of risking a cascading failure in some
cases, no you don't.

> Application X, whose rate of continual DNS lookups is why
> there's a local dnscache in the first place, needs as close to
> uninterrrupted DNS service as it can get, even in the face of system
> administrators who know that "we can just clear that problem out of
> the local cache and get things fixed today by killing the DNS server
> and letting it auto-restart, can't we?" and then terminate the
> service twice.

  Sysadmins *should* be able to make that assumption, and even if the
restart is delayed by one second (zomg one second of downtime for one
process), they should never hesitate to go for the easy fix.

  I've been an SRE. Trust me, when you're an SRE, you *want* the easy
fixes. You need all your brain power to address the complex issues
without being bothered by something as trivial as a bogus cache
entry. And you also do *not* want to risk a cascading failure every
time you restart a freakin' cache.

  If your process is mission-critical, have more than one instance,
end of story. One second of downtime on one of your processes should
not be visible to the end users, *ever*.

-- 
  Laurent

Received on Mon Jun 08 2015 - 09:20:58 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:44:19 UTC