Automated monitoring triggers Bob Eldering, Software Engineer JIVE December 12th, 2011 Contents 1 Introduction 2 2 Design 2 2.1 Items to monitor . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Experiment items 2 . . . . . . . . . . . . . . . . . . . . 2 2.1.2 Static items . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Triggers and dependencies . . . . . . . . . . . . . . . . . . . 3 2.3 Generating warnings . . . . . . . . . . . . . . . . . . . . . . 4 2.4 Screens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Implementation 6 3.1 GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1 1 Introduction This document describes a part of an automated monitoring system called CAIM. The system as a whole is described in http://www.jive. nl/~jive_cc/sin/sin31.pdf. This document focuses on how to generate appropriate warning messages from the data monitored. To be useful, the automated monitoring system will have to minimize the number of false positives, while still producing timely warnings when problems do occur. Another important feature will be to provide a concise message to the users. For the discussion in this document I have assumed that we are monitoring an eVLBI experiment. The set of items to monitor for a disk experiment should more or less be a subset of the items for an eVLBI experiment. 2 Design 2.1 Items to monitor There are two types of items we want to monitor: • Items for a specific experiment, these include weights, fringes, remote Mark5s and data rates. • Static items that can always be monitored, independent of the experiment running. These items include local Mark5s, the network switches and the database server. 2.1.1 Experiment items These items will have to be configured for each experiment specifically. The information required to configure them should be available in the various databases: • Remote Mark5s: at the start of the job, the participating stations are known, the CCS database has the IP address of these stations. Using this IP address we can monitor the SSH port, but also for example ping it or check that (the correct version of) the Mark5 software is running. • Data rates: for each participating station we expect data to arrive at a certain rate during scans in which they are active. We can use SNMP or query jive5a to monitor the actual data rates. The amount of data we expect might vary per station, or even per scan. It depends on the channels configured to be correlated. This 2 information is available in the experiment and correlator control database. Note that connectivity and a track mask might be configured for a specific station. These will reduce the expected data rate. Connectivity is applied by halving the expected data rate until it is below the configured value. Track mask gives a bit mask of tracks to send, so the expected data rate is multiplied by the fraction of 1 bits in the track mask. Connectivity and track mask cannot be active (>0) at the same time, therefor the following rules apply (in order): – If connectivity and track mask are active, track mask is used. – If connectivity is active, connectivity is used. – Otherwise neither is used. • Weights: similar to data rates, we expect good weights for all configured channels. In the case of weights, a track mask will actually have to be mapped to channels to see which of the channels are masked. • Fringes: when the target source is a calibrator, when can expect fringes on some baselines. To be able to detect weak fringes, we will integrate the fringe over the whole scan. On this fringe we will calculate a signal to noise ratio. Note that restrictions on the expected channels of the weights also apply to the fringes. But a fringe involves two stations, so we do not expect to see a fringe on a certain channel if that channel is filtered out by connectivity or track mask on either station. 2.1.2 Static items Some items can be monitored continuously. Any piece of hardware or software that always should be running can be included in this list. But for the first version of the automated monitoring service, the following should suffice: • All local Mark5s can be monitored similar to the remote Mark5s: using SSH, ping and possibly check for running control software. • The various network switches used during an eVLBI experiment can be monitored using SNMP. • The database server should always be available, this should be monitored. 2.2 Triggers and dependencies The items described in the previous section will produce series of values on which we should define triggers to warn the user. These triggers 3 will be defined by thresholds on the values. We will also require that the values are below that threshold for a certain duration. The actual values of these threshold will require some calibrating. We propose the following start points for this calibration: • Local/remote Mark5, network switches and the database: as these will be up/down statuses, we only need a duration threshold. A minute seems a reasonable starting point, but this will depend heavily on the sampling frequency (we should at least have 2 or 3 failed status checks before sending a warning). • Data rates: If the average data rate over a minute is 10% below or above the expected data rate we should raise a warning. • Weights: If the weights drop below 0.9 for more than a minute we should activate the trigger. We might want to be able to set this threshold globally or per station through some GUI, possibly Zabbix itself could suffice as the GUI. • Fringes: we can actually use the fringe SNRs for multiple triggers. All of these triggers will use expected SNR values that depend on the baseline and target source. These expected values are hard to determine automatically, so they will be configured by the user. Having the expected SNRs, we can configure triggers on multiple levels. – If the SNR is high on LR/RL polarizations, but low on LL/RR polarizations, we should warn the user for possible swapped polarizations. – All fringes to one station are below their expected value: this points to a problem with that station. – All fringes on a certain channel are below their expected value points to a problem with only that channel. – Specific LL/RR fringes: if certain interferometers, which cannot be grouped as described above, show lower than expected SNRs, this might indicate a problem with the correlator. We can say that the described triggers form a dependency graph. Meaning that if a certain trigger activates, we expect all triggers linked to this trigger to also activate. For example, if a remote Mark5 crashes, we also expect the data and weights to drop and the fringes to disappear. However we would only like to receive a warning for the remote Mark5, to allow the user to focus on the root problem. Said dependency graph is shown in Figure 1. Note that the dependency relation is transitive. 2.3 Generating warnings For the prototype of the automated monitoring system, we will simply send email message to a pre-defined set of addresses. A message will 4 Figure 1: Dependency graph between items monitored. An arrow from item A to B means we expect A to trigger if B triggers. be generated whenever an issue arises or is resolved. Any message should also contain still active issues. Even if the warning system takes the dependency between the triggers into account, we still might generate a lot of warnings at the same time. Therefor we should rate limit the amount of email being sent to at most one per minute. This means that if multiple triggers activate within a minute, their warnings are gathered into one email message. 2.4 Screens Zabbix allows us to quite easily graph any incoming data. We can gather these graphs in so called screen. So as a bonus we could create screens like: a graph with all weights for each station. Which graphs are useful to group together to create a screen, should become clear when the system is in actual use. 5 3 Implementation The implementation can be divided in two parts: a GUI to configure thresholds and a library which will configure Zabbix given a job identifier. Both parts are implemented in python, the GUI using Qt for it’s widgets. 3.1 GUI The first implementation allows the user to configure two elements: • Channels which should be ignored: this will be useful when a channel is known to be bad, but still scheduled to be correlated. Otherwise CAIM will generate warnings when there is no problem that can actually be fixed. • For which scans and baselines to expect fringes. As discussed in Section 2.1.1, fringes are usually expected on calibrator sources. But whether these fringes are strong enough depends on a lot of variables: like the source, the position of the source on the sky and the stations forming the baseline. These two configuration options are displayed in tabs in the GUI, as shown in Figure 3.1. Figure 2: The two tab views of the GUI The channels configuration options shows a radio button selection for stations on the left and checkboxes for channels to ignore for the selected station on the right. The calibrators configuration option show a scan selection widget in the middle of the screen. Below this selection widget are the four available configuration options for the selected scans: • All baselines: simply expect fringes on the baselines formed by every pair of stations in the job. 6 • To a certain station: only expect fringes on the baselines where one of the stations is the station selected in the drop down menu next to the option. • European baselines: most of the jobs correlated at JIVE include stations in Europe with a few stations in other continents. This option will only expect fringes on the (shorter) European baselines. • Expect no fringes at all: this is the default option for all scans. The first implementation of the GUI is limited to these four options instead of a free baseline configuration to keep it simple. For the same reason we use a constant threshold value for weights and fringes. Common to the two tab views is the control at the bottom of the screen. It allows the selection of an experiment to configure and saving the current configuration. Note that selecting another experiment will clear all configurations not saved. Clicking the ’Save’ button will store the settings in the caim database. When this is done, the GUI will use the library discussed below to reconfigure Zabbix for the new configuration. 3.2 Library The library is implemented as a python module. It exposes one public method: configure_job. This method takes a subjob ID to retrieve all data required from the experiment, correlator_control and caim databases. This data includes the GUI configuration as described in the previous section. Using this information it will configure Zabbix items as described in Section 2. The first implementation focuses on detecting errors in correlator data. Therefore the SNMP items are not implemented yet. 7
© Copyright 2026 Paperzz