Service Avai Calculation Me Ca cu at oe lability ethods

Service Availability
y
Calculation
Ca
cu at o Me
e
et ods
ethods
Pio
Pi
otr
t Nyczyk
N
k
WLCG Service Reliability
W
Workshop
p
CERN, 26-30 Novem
mber 2007
SAM
tests
SAM tests
(VO-specific)
(VO-specific)
SAM tests
(OPS)
M
Summarisation
module
• SAM stores raw test
results for all VOs in the
DB
• GridView implementation
of status and availability
calculation algorithm (nex
slides)
• One list of critical tests pe
VO
• Efficient (?) calculation working
ki di
directly
tl on th
the DB
(SQL statements)
• Continuous time scale for
status calculation (no
(
sampling)
Old algorithm
g
((not used any
y more))
{Computes Service Status on Discrete
e Time Scale with precision of an hour
{Test results sampled at the end of each hour
{Service status is based only on latest results for an hour
{Availability for an hour is always 0 or 1
Drawbacks of Discrete Time Scale
e
{Test may pass or fail several times in an hour
{not possible to represent several even
nts in an hour
{loss of precise information about the exact
e
time of occurrence of the event
N
New
Al
Algorithm
ith ((current)
t)
{Computes Service Status on Continuo
ous Time Scale
{Service Status is based on all test ressults
{Computes Availability metrics with hig
gher accuracy
{Conforms to Recommendation 42 of E
EGEE-II Review Report about introducin
measures of robustness and reliabilityy
{Computes reliability metrics as approvved by LCG MB, 13 Feb 2007
Major differences between old and new algorithm
{Service Status computatio
on on Continuous time scale
{Consideration of Schedule
ed Downtime (SD)
zService may pass tests even when SD
zLeads to Inaccurate Reliabilityy value
zNew algorithm ignores test re
esults and marks status as SD
{Validity of Test Results
z24 Hours for all tests in old ca
ase
zDefined
D fi d b
by VO separately
l ffor each
h test iin N
New method
h d
zInvalidate earlier results after scheduled interruption
{Handling of UNKNOWN status
Up
Down
Unknown
Scheduled Down
Tes
st result considered
in
nvalid
l d after
f
a VOspecific timeout
uled
time
0
1
2
3
4
5
6
7
8
9
10
11
12
t1
t2
t3
t4
go
algo
g
Two tests failed but
One test failed at
a the end of the
All results available
old algorithm shows
hour but old algorithm shows
andt tests
passed.
passed
No
St
Status
UP
as
tests
t
tservice
are
service
i UP b
because
i DOWN
Nb
because only
l
difference
between
S
ervice
instance
All earlier test results
passed
even
when
only final tests in the
final tests in the hour are is
sult of test4 is unknown.
old
and
new
ma
arked
as
scheduled
are invalidated and
scheduled Down,
hour are considered
consiidered
he old algorithm ignores
algorithms
shutdown.
treated as unknown at
results in inaccurate
his, whereas the service
the end of a scheduled
reliability value
ance is marked ‘unknown’In the new algorithm, status is
Finest granularity: shutdown in the new
based on all
the test
results
Coarse
granularity:
in the new algorithm
algorithm.
Status can change at
and is Status
more accurate
change only at
any moment,
13
1
(new algorithm)
consider only critical tests
for a vo
ANDing
all service statuses up
up
at least one service status down
dow
no service down and at least one is sd
sd
no service down or sd and at least one
is unknown
Results
atus per
si vo)
si,
aggregate
test status
Service Instance
Status
per (si
(si, vo)
unk
ServiceStatus
aggregate
aggregate
per (site, service, vo) site service status
service
e instance
sttatus
for site
e services
ANDing
arked as scheduled down (sd)
sd
tuses are ok
up
e test status is down(failed)
down
atus down and
e test status is unknown
unknown
ORing
At least one service instance status up
up
No instance up and at least one is sd
sd
No instance up or sd and at least one
insttance is down
All iinstan
t nces are unknown
k
= a service type (e.g. CE, SE, sBDII, ...)
down
unknown
k
Site
per
nlyy one list of critical te
ests p
per VO used for
verything:
availability metric (reports
s)
operational alarms
BDII exclusion using FCR
R
onsequence: onlyy one target
g metric per VO mply “availability” (pluss “reliability”)
ompletely uniform:
all sites treated in the sam
me way (Tier 0
0,1,2)
1 2)
ifferent criticality targetss in several dimensions
metric usage:
g availability
y report,
p , alarms,, BDII
exclusion
application domain or fun
nctionality: simulation
simulation,
reconstruction, analysis, ... (VO dependent)
ervice and site categorisation: tier 0, 1, 2, ...
depending on computational model)
ore complex status calculation logic:
OR expressions on tests aggregation (not only
simple AND of critical tes
sts)
additional factors (not kno
own at design time)
ntelligent tests”” - varying re
esults on site role
masking failures of Tier-1
Tier 1 rela
ated tests on Tier-2
Tier 2 sites
additional knowledge and log
gic needed by the test (where
from?)
Disadvantages: complex tessts,
sts mixing tests with results
analysis
xternally calculated availab
bility
get raw test results using SA
AM API (last hour/day)
calculate customised metricss
store in own DB (experimentss dashboards)
Disadvantages: additional d
data transfers
transfers, storage
redundancy (synchronisation
n?), no feedback to SAM
essaging system (MSG) as mo
onitoring data exchange bus
all tests published to a common messaging
m
system as metrics (SAM
and other monitoring
g tools))
everyone can subscribe - multicasst approach to distribute raw result
to many repositories
distributed and decentralised calculation of derived metrics by
customised summarisation compo
onents
d i d metrics
derived
t i re-published
bli h d tto th
the
e messaging
i system
t
SAM/FCR can subscribe to variou
us derived metrics for different
purposes:
• alarm metric
• availability metric (report)
• BDII exclusion triggering metric
dvantage: best flexibility and sccalability,
calability possible
ecentralisation of SAM/GridView
hallanges: need for robust messsaging (additional effort)
effort),
ealing with message latencies (on-line summarisation, delay
efactored SAM/GridView sum
mmarisation algorithm
stay within SAM/GridView DB an
nd GridView summarisation
component
brake current availability metric in
nto many VO-dependent and
named metrics
provide a flexible way (language)) to define metric calculation rules
expose interface
i t f
to
t define
d fi new metrics
m ti b
butt calculate
l l t th
them allll in
i
situ
expose API to access calculated metrics
dvantages: uniform software
software, standardisation
sadvantages: possible loss of flexibility, centralisation,
calability?
AM/GridView as monolith (+
+ current workarounds)
raw test results published only to SAM DB
es / metric calculation
replicated to remote data store
components using periodic API queries
incremental updates (new resu
ults in last hour)
metrics calculated by customissed remote summarisation
modules
d l
published back to SAM (how?)
(
)
metric results p
dvantages: least effort, possiible migration to solution 1
sadvantages: ineffective, centralised, significant
tencies - can be considered only
o y as an intermediate
olution (!)
reakdown of “availability
availability metric
metric” into several
pecific metrics is needed anyway
obust and scalable messsaging layer allowing
ulticast
lti
t publishing
bli hi iis una
avoidable
id bl iin llonger tterm
eneral understanding of w
which metrics are really
eeded and what they represent is crucial (not to
et lost in plenitude of mea
aningless metrics)
ecentralisation
t li ti and
d di
distrib
t ib
b ti off currentt
bution
AM/GridView
/G d e sys
system
e iss p
probably
obab y a good move
o e
Which solution to chose?
?
Wh t are exactly
What
tl the
th requirements?
i
t ? ((we need
d
o build a list)
o we agree
g
on common
n formats?
message format
monitoring data exchange
e / queries
Will allll new availability/re
il bilit / eliability
li bilit metrics
ti
ecome just normal metrrics? (as defined by the
WLCG Monitoring WG)
Who will do this?