Conservative Scheduling : Using Predicted Variance to Improve

Statistical Data Reduction
for Efficient Application Performance
Monitoring
Lingyun Yang, Jennifer M. Schopf,
Catalin L. Dumitrescu, Ian Foster
University of Chicago
Argonne National Laboratory
Introduction

In distributed and shared systems
– Performance of resources change dynamically
– Variability in resource performance can have a major
influence on application performance

To deliver dependable and sustained performance
to applications
– Performance monitoring and anomaly diagnosis are
necessary
What is the problem?

System can be characterized by a set of system metrics:
– M=(m1, m2,… mn;)
– Example: (cpu load, band, free mem size, # of opened file… )

Application performance can be described quantitatively
by a performance metric: Y
– Example: number of computations finished in unit time
Monitor the performance of system components ( value of M),
such that we can diagnose the reason if an anomaly happens in
application performance (value of Y).
Solution

Challenges
– Computer systems and applications continue to
increase in complexity and size
– Interactions among components are poor understood
– Instrumentation will produce tremendous volumes of
data
> Result in complexity for data analysis and anomaly
diagnosis.

Requires a data reduction strategy:
– Reduce the number of system metrics that a
monitoring system must manage (necessary)
– Retain interesting characteristics of performance data
(sufficient)
Outline

Problems

>Data Reduction Strategy
– Two observations
– Redundant system metrics reduction
– Statistical Variable Selection

Experiments

Conclusion
Two Observations

Some system metrics may capture the same or
similar information
– They are correlated each other
– Only one is necessary, the other is redundant

Not all system metrics will be related with a
particular application performance
– Some system metrics are unrelated to the performance
of application, so unnecessary.
Two steps data reduction strategy
Redundant system metrics reduction

Clustering based method:
– Use correlation coefficient (r) to measure the degree
of correlation between two system metrics
– Group metrics with high correlation coefficient into
clusters
– Eliminate all but one of those metrics in one cluster

Two questions:
– A threshold value t ( determined experimentally)
– A method to compare
How to compare

Traditional method: Mathematical comparison
–

r >t ?
Problems:
– Only limited number of sample data are available
– r may change using data collected during different runs.
Sample correlation coefficient
between the number of transfers
issued per second and the number
of memory pages cached per
second for 20 runs of cactus
application
May eliminate uncorrelated metrics only by chance.
Z-test



Reduce false error given limited number of sample
data.
And avoid group uncorrelated metrics into one
cluster
Z-test
– A statistical method
– Determine whether an observed correlation is statistically
significant larger than threshold value (95% confidence in
my work).
Redundant metrics reduction Alg.
Given a set of samples, we proceed as follows.
– Perform the Z-test for correlation coefficient between
every pair of system metrics.
– Group two metrics into one cluster only when the
absolute value of their correlation coefficient is
statistical significantly larger than the threshold
value.
– The result of this computation is a set of system
metric clusters.
– System metrics in each cluster are strongly
correlated, so only one metric from the cluster can
be used as the representative of the cluster while the
others are deleted as redundant.
Outline

Problems

Data Reduction Strategy
– Two observations
– Redundant system metrics reduction
– > Statistical Variable Selection

Experiments

Conclusion
Statistical Variable Selection




Some of these system metrics may not relate
to our chosen performance metric
Identify the subset of all system metrics that
are necessary to capture the performance
metric
This form of data reduction is also known as
variable selection
We use the Backward Elimination (BE)
stepwise regression method to select the
system metrics
BE stepwise regression method

System metrics concerned X=(x1, x2,… xn)

The application performance metric y

Steps:
1. Y=0+1x1+2x2+…nxn
2. Which xi is the most useless in this model?
 By calculating the F value of each xi
 The F value of each xi captures its contribution to the model
3. Is the smallest F value < predefined significant value ? If
yes, delete according xi, go to 1.
4. All metrics left are useful when to capture the variation of
Y.
Outline

Problems

Data Reduction Strategy

>Experiments
– Application and data collection
– Two criteria
– Experiment methodology
– Results

Conclusion
Application and Data Collection

Application: Cactus

Testbed: six Linux machines on UCSD

Data collected at 0.033HZ for 24 hours


Every data point include 600+ system metric value
and 1 application performance value
Collect system metrics on each machine using
three utilities:
– (1) The sar command of the SYSSTAT tool set,
– (2) Network weather service (NWS) sensors, and
– (3) The Unix command ping
Two criteria

Reduction degree (RD) --necessary
– Total percentage of system metrics eliminated

coefficient of determination ( R2 )-- sufficient
– A statistical measurement
– Indicates the fraction of the total variability in the
performance of application, that can be explained by
the system metrics selected.
– Larger R2 value means system metrics selected can
better capture the variation of performance of
application.
Experiment methodology
24 hour long data is partitioned into 12 equal-sized chunks.

Using the first chunk of data as the training data,the left 11
chunks of data as the verification data.
2 steps experiment:

Data Reduction

– Using training data to select system metrics.




Verification:
Is these system metrics sufficient?
Is the result stable?
How is this method compared with other strategies?
– RAND, randomly picks a subset of system metrics equal in
number to those selected by our strategy
– MAIN, uses a subset of 75 system metrics that are commonly
used to model the performance of applications by other works.
Data Reduction using training data




Threshold  , RD . Since fewer system metrics group into
clusters and thus are removed as redundant
R2 , Since more information is available to model the
application performance
RD=0.78, R2 = 0.98. when the threshold value = 0.95
A total of 141 of the original 628 system metrics were selected
System metrics selected on one machine
Name
wtps
activepg
proc/s
rxpck/s
txpck/s
coll/s
kbbuffers
ip-frag
runq-sz
ldavg-5
ldavg-15
campg/s
dentunusd
file-sz
Rtsig-sz
cswch/s
Latency
bandwidth
AvailCPU
FreeMem
Measurement
Total number of write requests per second issued to the physical disk.
Number of active (recently touched) pages in memory
Total number of processes created per second.
Total number of packets received per second
Total number of packets transmitted per second.
Number of collisions that happened per second while transmitting packets.
Amount of memory used as buffers by the kernel in kilobytes.
Number of IP fragments currently in use.
Run queue length (number of processes waiting for run time)
System load average for the past 5 minutes.
System load average for the past 15 minutes.
Number of additional memory pages cached by the system per second.
Number of unused cache entries in the directory cache.
Number of used file handles.
Number of queued RT signals.
Number of context switches per second.
Amount of time required to transmit a TCP message to target machine
Speed with which data can be sent to a target machine per second
Fraction of CPU available to a newly-started process.
Amount of space unused in memory
Verification

R2 value of SDR, MAIN and RAND

SDR exhibited an average R2 value of 0.907

55.0% and 98.5% higher than those of RAND and MAIN

System metrics selected by SDR are significantly more
efficient than the alternatives for capturing Cactus
performance
Verification Results Analysis

The system metrics selected by our
strategy is:
– Sufficient to capture the variation in the application
performance (average R2 value of 0.907)
– Stable (high R2 value over a far long time:24 hours)
– Better than the other two strategies concerned.
Conclusion

Statistical data reduction strategy
– Reduce redundant system metrics which conveying
the same information
> Cluster based method +Z test
– Reduce the unnecessary system metrics which are
unrelated to the performance of applications
> BE stepwise regression method

Identify system metrics that are:
– Only necessary ( high reduction degree value)
– And sufficient to capture application behavior( higher R2
value than other strategies)
Contact

Lingyun Yang: [email protected]

Jennifer M . Schopf: [email protected]

Catalin L. Dumitrescu: [email protected]

Ian Foster: [email protected]