Re: perp - how to notify if service suddenly starts dying all the time

From: Jonathan de Boyne Pollard <J.deBoynePollard-newsgroups_at_NTLWorld.com>
Date: Thu, 20 Aug 2015 09:53:24 +0100

Georgi Chorbadzhiyski <georgi.chorbadzhiyski_at_gmail.com>:
> I don't want to overwhelm our admin team with notices on every service
> restart (we are managing thousands of servers). I need a notice only
> if the service restarts more than X times in a minute, which is a sign
> that something is most definitely wrong. I'll have to hack something up.

Been there. Done that. Received the mail deluge. Wrote the Nagios
plug-in. (-:


Nagios isn't the only answer here. But it's one of them. I'm not the
only person to have written such a Nagios plug-in, by a long chalk, as
can be seen. Here are just a few examples skimmed from the WWW:

*
http://productionmonkeys.net/guides/qmail-server/addons/nagios-monitoring/check_daemontools_service
* https://github.com/nekoya/nagios-plugins-svstat
* http://www.openfusion.com.au/labs/nagios/

There are a lot of variations on this idea, from those that parse the
output of svstat (which is unwise because it is, as the original
Bernstein manual page said, explicitly human-readable not machine
readable) to those that understand the format of a supervice/status file
directly. Mine is the nagios-check-service command in nosh, employed in
Nagios like this:

command[check_services]=/usr/local/bin/system-control
nagios-check-service /service/* /service/*/log

The check is not the number of restarts, but the length of time that the
service is in the "running" state or whether it is "stopped",
"starting", "stopping" and so forth. (It understands the
daemontools-encore statuses, if they are present.) There's a
command-line option for tuning that. See the manual page for details.

It's then up to the sysadmin team to determine how long a succession of
CRITICALs results in a middle-of-the-night alert to the on-duty
operators, which they do by configuring Nagios appropriately.

Incidentally: Because svstat is human-readable, for the benefit of
someone who wanted to process service status/configuration without
knowing the format of a supervise/status file I also wrote svshow, which
outputs the information in either JSON or .INI format.
Received on Thu Aug 20 2015 - 08:53:24 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:44:19 UTC