Managing Mesh Networks

NetQuest: A Flexible Framework
for Internet Measurement
Lili Qiu
Joint work with
Mike Dahlin, Harrick Vin, and Yin Zhang
UT Austin
Motivation
Server
Server
Sprint
C&W
UUNet
Qwest
Server
AOL
AT&T
SBC
Earthlink
Server
2
Motivation (Cont.)
AOL
C&W
UUNet
Qwest
Sprint
AT&T
SBC
Earthlink
Why is it
so slow?
3
Motivation (Cont.)
Applications are performance-aware
–
–
–
–
–
–
Server selection
Fault diagnosis
Traffic engineering
Overlay networks
Peer-to-peer applications
…
Internet: large & decentralized
Network measurement is important to
–
–
–
–
–
ISPs
Enterprise and university networks
Application and protocol designers
End users
…
4
Key Requirements
• Scalable: work for large networks (100 –10000
nodes)
• Flexible: accommodate different applications
– Multi-user design
• Multiple users interested in different parts of network or have
different objective functions
– Augmented design
• Conduct additional experiments given existing observations,
e.g., after measurement failures
– Differentiated design
• Different quantities have different importance, e.g., a subset
of paths belong to a major customer
Q: Which measurements to conduct to
estimate the quantities of interest?
5
What We Want
A function f(x) of link performance x
– We use a linear function f(x)=F*x in this talk
Ex. 1: average link delay
x2
f(x) = (x1+…+x11)/11
3
2
x4
x1
x3 Ex. 2: end-to-end delays
x
5
1
4
x11
x10
6
x6
x7
x9
5
x8
7
1 0 .... 0  x1 
1 1 0... 0  x 
 2 
f ( x)  
. .
.
.  : 

 
0 ... 0 1  x11 
Apply to any additive metric,
eg. Log (1 – loss rate)
6
Problem Formulation
What we can measure: e2e performance
Network inference
– Given e2e performance, infer link performance
– Infer x based on y=F*x, y, and F
Design of measurement experiments
– State of the art
• Probe every path (e.g., RON)
• Rank-based approach [sigcomm04]
– Select a “best” subset of paths to probe so that we can
accurately infer f(x)
– How to quantify goodness of a subset of paths?
7
Bayesian Experimental Design
• Notations
– D: a measurement design (eg., a subset of paths
to probe)
– I: an inference algorithm
– U(D,I): utility function for design D and
inference I
• A good design maximizes the expected
utility under the optimal inference
algorithm
8
Design Criteria
• Let D( )  ( AST As  R) 1, where  2 R 1 is
covariance matrix of x
• Bayesian A-optimality
2
||
Fx

Fx
||
– Goal: minimize the squared error
c 2
 A ( )  tr{FD( ) F }
T
• Bayesian G*-optimality
– Goal: minimize the worst-case squared error
G* ( )  max {FD( ) F T }
• Bayesian D-optimal
– Goal: maximize the expected gain in Shannon
information
D ( )  det{D( )}
9
Flexibility
Multi-user design
–
New design criteria: a linear combination of
different users’ design criteria
Augmented design
–
Ensure the newly selected paths in
conjunction with previously monitored paths
maximize the utility
Differentiated design
–
Give higher weights to the important rows
of matrix F
10
Evaluation Methodology
Data sets
– NLANR traces
• RTT, loss, traceroute measurements between
pairs of 140 universities in Oct. 2004
– Resilient overlay network (RON)
• RTT and loss among 12-15 hosts in March & May
2001
Accuracy metric
normalized
| inf er  actual |

MAE 
 actual
i
i
i
i
i
11
Evaluation Results (Cont.)
Estimate network-wide average path delay (NLANR)
0.8
normalized MAE
0.7
0.6
0.5
0.4
0.3
All pairwise
Rank-based
0.2
0.1
0
0
50
100
150
200
250
300
350
400
# monitored paths
A-opt
G*-opt
D-opt
12
Evaluation Results (Cont.)
normalized MAE
Estimate all paths' delay (NLANR)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Rank-based
0
100
200
All pairwise
300
400
# monitored paths
13
A-opt
G*-opt
D-opt
Summary of Other Results
• Bayesian experimental design can support
– Multi-user design
– Augmented design
– Differentiated design
• Inference accuracy also depends on
– Inference algorithms
– Prior information
14
Summary
Our contributions
– Bring Bayesian experimental design to network
measurement
– Develop a flexible framework to accommodate
different design requirements
– Experimentally show its effectiveness
On-going work
– Build a toolkit
– Gain operational experience
– Develop applications
• anomaly detection
• performance knowledge plane
15
Thank you!
16