Network Degree Distribution Inference Under Sampling
Aleksandrina Goeva
Eric Kolaczyk 1
Rich Lehoucq 2
1
1 Department
of Mathematics and Statistics,
Boston University
2 Sandia National Labs
August 18, 2016
BU/Keio, Boston, MA
A. Goeva, E. Kolaczyk, R. Lehoucq
Network Degree Distribution Inference Under Sampling
Motivation
Sampling introduces randomness in the sampled network.
Sampled network characteristics may not represent those of the true
network well.
A. Goeva, E. Kolaczyk, R. Lehoucq
Network Degree Distribution Inference Under Sampling
Sampling Mechanisms - Examples
1
1
Kolaczyk (2009)
A. Goeva, E. Kolaczyk, R. Lehoucq
Network Degree Distribution Inference Under Sampling
Setup
Problem introduced by Frank (1968)
EN ⇤ = PN
N = (N0 , N1 , . . . , NM ), is the degree counts vector of the true
network,
⇤ ), is the degree counts vector of the sampled
N ⇤ = (N0⇤ , N1⇤ , . . . , NM
network,
P is a linear operator that depends fully on the sampling scheme and
not on the network itself, and
M is the maximum degree in the true network G
A. Goeva, E. Kolaczyk, R. Lehoucq
Network Degree Distribution Inference Under Sampling
Naive Solution Issues
Nbnaive = P
1
N⇤
P is typically non-invertible.
Solutions may not be non-negative.
2
2
Zhang, Kolaczyk, Spencer (2015)
A. Goeva, E. Kolaczyk, R. Lehoucq
Network Degree Distribution Inference Under Sampling
Problem Formulation
N ⇤ = PN + E
Ill-posed linear inverse problem.
P is not random, depends only on the sampling design.
E is the noise due to sampling.
E[E] = 0
E[EE T ] = C
A. Goeva, E. Kolaczyk, R. Lehoucq
Network Degree Distribution Inference Under Sampling
Proposed Approach
Complexity Functional
e ·) = (P Ne
K (N,
where
·)T C
1
(P Ne
e 2
·) + ||D N||
2
is a regularization parameter
D is a second-order di↵erencing operator
C = Cov (N ⇤ ) = E[EE T ]
Look for a constrained solution
Ne 2 C := {Ne : Ne
A. Goeva, E. Kolaczyk, R. Lehoucq
0 and 1T Ne = nv }
Network Degree Distribution Inference Under Sampling
C-constrained Minimum Empirical Complexity Estimate
Constrained Penalized Weighted Least Squares
min (P Ne
e
N
N ⇤ )T C
subject to Ne 2 C
1
(P Ne
e 2
N ⇤ ) + ||D N||
2
C-constrained minimum empirical complexity estimate:
e N ⇤)
Nb = argmin K (N,
e
N2C
A. Goeva, E. Kolaczyk, R. Lehoucq
Network Degree Distribution Inference Under Sampling
Quality of the Solution
We aim to upperbound the risk:
b PN||2 1 ]
E[||P N
C
T
b
= E[(P N PN) C 1 (P Nb
b PN)]
E[K (N,
K (N 0 , PN) + 2 E[< C
where
PN)]
1/2 E, C 1/2 (P N
b
PN 0 ) >]
e PN)
N 0 = argmin K (N,
e
N2C
is the C-constrained minimizer of theoretical complexity.
A. Goeva, E. Kolaczyk, R. Lehoucq
Network Degree Distribution Inference Under Sampling
First Term
K (N 0 , PN)
This is the minimum theoretical complexity.
This term is not random.
Bounded in terms of a functional of the sampling design.
A. Goeva, E. Kolaczyk, R. Lehoucq
Network Degree Distribution Inference Under Sampling
Di↵erent Regimes
Underlying all sampling mechanisms there is a fundamental quantity p
controlling the rate of sampling.
The problem behaves di↵erently depending on the values of p.
We identify three regimes:
p = 1: full information - trivial case, no noise, P is diagonal.
small p: the distribution of E is approximately Poisson.
moderate p: the distribution of E is approximately Normal.
A. Goeva, E. Kolaczyk, R. Lehoucq
Network Degree Distribution Inference Under Sampling
Di↵erent Regimes - Small p
10
5
Sample Quantiles
15
Small p ⇡ 10% to 20%: the distribution of E is appoximately Poisson.
5
10
15
Poisson Theoretical Quantiles
A. Goeva, E. Kolaczyk, R. Lehoucq
Network Degree Distribution Inference Under Sampling
Di↵erent Regimes - Moderate p
Moderate p ⇡ 30% to 60% the distribution of E is appoximately
Normal.
0
−20
−10
Sample Quantiles
10
20
Normal Q−Q Plot
−3
−2
−1
0
1
2
3
Theoretical Quantiles
A. Goeva, E. Kolaczyk, R. Lehoucq
Network Degree Distribution Inference Under Sampling
Second Term
E[< C
"
E
1/2 E, C 1/2 (P N
b
sup
b PN 0 2set
PN
<C
1/2
E, C
PN 0 ) >]
1/2
(P Nb
0
PN ) >
#
Under the moderate p regime, the distribution of E is reasonably
close to Gaussian.
Assuming the entries of C 1/2 E are independent standard Gaussian,
we can bound this term using Gaussian widths.
A. Goeva, E. Kolaczyk, R. Lehoucq
Network Degree Distribution Inference Under Sampling
Summary
Motivation:
Problem arises in the context of sampled networks.
Under many sampling designs the expectation of the sampled degree
distribution is the product of a design-dependent matrix and the true
underlying degree distribution.
Main Idea:
Unusual ill-conditioned linear inverse problem.
The empirical analysis of Zhang, et al. (2015) of the constrained
penalized weighted least squares solution is the first non-parametric
approach to the problem since it was proposed ⇠ 35 years ago.
To our knowledge, our work is the first attempt to produce theoretical
guarantees on the performance of the proposed solution.
Thank you!
A. Goeva, E. Kolaczyk, R. Lehoucq
Network Degree Distribution Inference Under Sampling
© Copyright 2026 Paperzz