Automated monitoring triggers

Automated monitoring triggers
Bob Eldering, Software Engineer JIVE
December 12th, 2011
Contents
1 Introduction
2
2 Design
2
2.1 Items to monitor . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Experiment items
2
. . . . . . . . . . . . . . . . . . . .
2
2.1.2 Static items . . . . . . . . . . . . . . . . . . . . . . . .
3
2.2 Triggers and dependencies . . . . . . . . . . . . . . . . . . .
3
2.3 Generating warnings . . . . . . . . . . . . . . . . . . . . . .
4
2.4 Screens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
3 Implementation
6
3.1 GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
3.2 Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1
1
Introduction
This document describes a part of an automated monitoring system
called CAIM. The system as a whole is described in http://www.jive.
nl/~jive_cc/sin/sin31.pdf. This document focuses on how to generate appropriate warning messages from the data monitored.
To be useful, the automated monitoring system will have to minimize
the number of false positives, while still producing timely warnings
when problems do occur.
Another important feature will be to provide a concise message to the
users.
For the discussion in this document I have assumed that we are monitoring an eVLBI experiment. The set of items to monitor for a disk
experiment should more or less be a subset of the items for an eVLBI
experiment.
2
Design
2.1
Items to monitor
There are two types of items we want to monitor:
• Items for a specific experiment, these include weights, fringes,
remote Mark5s and data rates.
• Static items that can always be monitored, independent of the experiment running. These items include local Mark5s, the network
switches and the database server.
2.1.1
Experiment items
These items will have to be configured for each experiment specifically.
The information required to configure them should be available in the
various databases:
• Remote Mark5s: at the start of the job, the participating stations
are known, the CCS database has the IP address of these stations.
Using this IP address we can monitor the SSH port, but also for
example ping it or check that (the correct version of) the Mark5
software is running.
• Data rates: for each participating station we expect data to arrive
at a certain rate during scans in which they are active. We can
use SNMP or query jive5a to monitor the actual data rates.
The amount of data we expect might vary per station, or even per
scan. It depends on the channels configured to be correlated. This
2
information is available in the experiment and correlator control
database. Note that connectivity and a track mask might be configured for a specific station. These will reduce the expected data
rate. Connectivity is applied by halving the expected data rate until it is below the configured value. Track mask gives a bit mask
of tracks to send, so the expected data rate is multiplied by the
fraction of 1 bits in the track mask. Connectivity and track mask
cannot be active (>0) at the same time, therefor the following rules
apply (in order):
– If connectivity and track mask are active, track mask is used.
– If connectivity is active, connectivity is used.
– Otherwise neither is used.
• Weights: similar to data rates, we expect good weights for all configured channels. In the case of weights, a track mask will actually
have to be mapped to channels to see which of the channels are
masked.
• Fringes: when the target source is a calibrator, when can expect
fringes on some baselines. To be able to detect weak fringes, we
will integrate the fringe over the whole scan. On this fringe we will
calculate a signal to noise ratio.
Note that restrictions on the expected channels of the weights also
apply to the fringes. But a fringe involves two stations, so we do
not expect to see a fringe on a certain channel if that channel is
filtered out by connectivity or track mask on either station.
2.1.2
Static items
Some items can be monitored continuously. Any piece of hardware or
software that always should be running can be included in this list. But
for the first version of the automated monitoring service, the following
should suffice:
• All local Mark5s can be monitored similar to the remote Mark5s:
using SSH, ping and possibly check for running control software.
• The various network switches used during an eVLBI experiment
can be monitored using SNMP.
• The database server should always be available, this should be
monitored.
2.2
Triggers and dependencies
The items described in the previous section will produce series of values
on which we should define triggers to warn the user. These triggers
3
will be defined by thresholds on the values. We will also require that
the values are below that threshold for a certain duration. The actual
values of these threshold will require some calibrating. We propose the
following start points for this calibration:
• Local/remote Mark5, network switches and the database: as these
will be up/down statuses, we only need a duration threshold. A
minute seems a reasonable starting point, but this will depend
heavily on the sampling frequency (we should at least have 2 or 3
failed status checks before sending a warning).
• Data rates: If the average data rate over a minute is 10% below or
above the expected data rate we should raise a warning.
• Weights: If the weights drop below 0.9 for more than a minute
we should activate the trigger. We might want to be able to set
this threshold globally or per station through some GUI, possibly
Zabbix itself could suffice as the GUI.
• Fringes: we can actually use the fringe SNRs for multiple triggers.
All of these triggers will use expected SNR values that depend on
the baseline and target source. These expected values are hard to
determine automatically, so they will be configured by the user.
Having the expected SNRs, we can configure triggers on multiple
levels.
– If the SNR is high on LR/RL polarizations, but low on LL/RR
polarizations, we should warn the user for possible swapped
polarizations.
– All fringes to one station are below their expected value: this
points to a problem with that station.
– All fringes on a certain channel are below their expected value
points to a problem with only that channel.
– Specific LL/RR fringes: if certain interferometers, which cannot be grouped as described above, show lower than expected
SNRs, this might indicate a problem with the correlator.
We can say that the described triggers form a dependency graph. Meaning that if a certain trigger activates, we expect all triggers linked to this
trigger to also activate. For example, if a remote Mark5 crashes, we also
expect the data and weights to drop and the fringes to disappear. However we would only like to receive a warning for the remote Mark5, to
allow the user to focus on the root problem.
Said dependency graph is shown in Figure 1. Note that the dependency
relation is transitive.
2.3
Generating warnings
For the prototype of the automated monitoring system, we will simply
send email message to a pre-defined set of addresses. A message will
4
Figure 1: Dependency graph between items monitored. An arrow from
item A to B means we expect A to trigger if B triggers.
be generated whenever an issue arises or is resolved. Any message
should also contain still active issues.
Even if the warning system takes the dependency between the triggers
into account, we still might generate a lot of warnings at the same time.
Therefor we should rate limit the amount of email being sent to at most
one per minute. This means that if multiple triggers activate within a
minute, their warnings are gathered into one email message.
2.4
Screens
Zabbix allows us to quite easily graph any incoming data. We can
gather these graphs in so called screen. So as a bonus we could create
screens like: a graph with all weights for each station. Which graphs
are useful to group together to create a screen, should become clear
when the system is in actual use.
5
3
Implementation
The implementation can be divided in two parts: a GUI to configure
thresholds and a library which will configure Zabbix given a job identifier. Both parts are implemented in python, the GUI using Qt for it’s
widgets.
3.1
GUI
The first implementation allows the user to configure two elements:
• Channels which should be ignored: this will be useful when a
channel is known to be bad, but still scheduled to be correlated.
Otherwise CAIM will generate warnings when there is no problem
that can actually be fixed.
• For which scans and baselines to expect fringes. As discussed in
Section 2.1.1, fringes are usually expected on calibrator sources.
But whether these fringes are strong enough depends on a lot of
variables: like the source, the position of the source on the sky
and the stations forming the baseline.
These two configuration options are displayed in tabs in the GUI, as
shown in Figure 3.1.
Figure 2: The two tab views of the GUI
The channels configuration options shows a radio button selection for
stations on the left and checkboxes for channels to ignore for the selected station on the right.
The calibrators configuration option show a scan selection widget in the
middle of the screen. Below this selection widget are the four available
configuration options for the selected scans:
• All baselines: simply expect fringes on the baselines formed by
every pair of stations in the job.
6
• To a certain station: only expect fringes on the baselines where
one of the stations is the station selected in the drop down menu
next to the option.
• European baselines: most of the jobs correlated at JIVE include
stations in Europe with a few stations in other continents. This
option will only expect fringes on the (shorter) European baselines.
• Expect no fringes at all: this is the default option for all scans.
The first implementation of the GUI is limited to these four options
instead of a free baseline configuration to keep it simple. For the same
reason we use a constant threshold value for weights and fringes.
Common to the two tab views is the control at the bottom of the screen.
It allows the selection of an experiment to configure and saving the
current configuration.
Note that selecting another experiment will clear all configurations not
saved.
Clicking the ’Save’ button will store the settings in the caim database.
When this is done, the GUI will use the library discussed below to
reconfigure Zabbix for the new configuration.
3.2
Library
The library is implemented as a python module. It exposes one public method: configure_job. This method takes a subjob ID to retrieve all data required from the experiment, correlator_control and
caim databases. This data includes the GUI configuration as described
in the previous section.
Using this information it will configure Zabbix items as described in
Section 2. The first implementation focuses on detecting errors in correlator data. Therefore the SNMP items are not implemented yet.
7