conditional estimation of exponential random graph models

Modelling large networks:
conditional estimation of
exponential random graph
models
Pip Pattison
University of Melbourne
Methodology for Empirical Research on Social Interactions
Harvard University, November 13-14, 2009
Joint work with
University of Melbourne
Garry Robins
Peng Wang
Galina Daraganova
Oxford University
Tom Snijders
Johan Koskinen
The problem
Can we estimate exponential random graph models (ERGMs)
when:
•  we have a sample rather than a census of network ties
(specifically, a snowball sample)?
•  the network is large?
•  the network size is unknown?
•  we have various model specifications in mind?
This work is intended to complement Handcock and Gile (in
press), who develop general likelihood inference for partial
network data with known network size
Exponential random graph models
ERGMs represent social networks as the outcome of known/
proposed network tie formation processes, including:
Exogenous effects:
For example, shared characteristics, interests and affiliations, spatial
propinquity (e.g., McPherson, Smith-Lovin & Cook, 2001)
Endogenous network effects:
e.g., clustering, comparison and attachment processes
We focus here on the endogenous component of the model, but
note that the other part is very important substantively!
Exponential random graph models
P(Y = y) = (1/()) exp{ppzp(y)}
•  Y = [Y(i,j)] is an n  n matrix of network tie variables,
with realisations y =[y(i,j)]
•  Y(i,j) = 1 if i is tied to j, 0 otherwise
•  zp(y) is a network statistic
•  p is a corresponding parameter
•  () = yexp{ppzp(y)} is a normalising quantity
Model parameters may be estimated using MCMCMLE
from a single observation of the network
Zones and neighbourhoods in a graph
For a subset A of nodes in network y:
Zk(A) = set of nodes for which the minimum distance to any node in A is k
zone of order k of A in y
Nk(A) = set of nodes within distance k of any node in A
neighbourhood of order k
For A comprising
marked vertices:
Nk(A) = Nk-1(A)  Zk(A)
Neighbourhoods of order 0, 1 and 2
Markov models (Frank & Strauss, 1986)
Markov model
Y(i,j) and Y(k,l) are conditionally independent unless {i,j} 
{k,l}   (i.e. N0{i,j}  N0{k,l}   )
Network statistics then correspond to:
m-stars (m = 1, , n-1) and triangles
m nodes
Social circuit models (Snijders et al, 2006)
Social circuit (realisation-dependent) model:
Y(i,j) andY(k,l) are conditionally independent unless N1{i,j}  {k,l} and N1
{k,l}  {i,j}
i
j
k
l
Network statistics are then subgraphs in which every pair of edges lies on
a 4-cycle, including: m-2-paths, m-triangles and m-cliques
m nodes
m nodes
m nodes in
all
Other assumptions
We also often assume:
1.  Homogeneity: isomorphic configurations have equal
parameters (Frank & Strauss, 1986)
2.  Related effects: a relationship between parameters within a
family, and hence a single statistic for the families of:
‒  m-stars
‒  m-triangles,
‒  m-2-paths
(Snijders et al, 2006; Hunter & Handcock, 2006)
Large networks: an old problem
Moreno, Sociometry, 1937, p26*
Three core problems:
•  Size of networks
•  Network measurement
•  Network dynamics
Multi-wave snowball sampling designs
Initial wave 0: Z0
ykk: network on Zk
A
snowball
sample
from a
population
of roughly
20,000
adults
Wave 1: Z1
Wave 2: Z2
ykl: ties from Zk to Zl
Link-tracing designs (Frank, 2005;
Handcock & Gile, in press)
Model-based frameworks
•  Handcock and Gile (in press): a general formulation for
likelihood-based inference
But what if the network is really large or n is unknown?
We assume, for the moment, a social circuit model:
P(Y = y) = (1/()) exp{ppzp(y)}
Conditional estimation (1-wave sample)
We use the fact that if i, j Z0, then Y(i,j) and Y(k,l) are
conditionally independent unless k, l  N]=Z0  Z1
Define Y[1,1] to be the set of network tie variables on N1 and y
0
0
[1,1] = y[1,1] but with ties in y00 set to 0
Then:
Pr(Y00=y00rest) = 1/C exp (p p [zp(y[1,1]) - zp(y[1,1]0)])
And hence we can use observed data y[1,1] on N1 to obtain
conditional MLEs for 
c.f. Besag s (1974) coding scheme
For higher-wave snowball samples
Zone connectivity condition:
Each node in Zh+1 can be reached from some node in Zh
Also:
Any tie in Y[k,k], the set of tie variables on Nk, is conditionally
independent of any tie not between nodes in Nk+1
Hence we can:
Estimate  using an MCMC in which we propose random changes
to the entries in Y[k,k], conditioning on the observed values of Yk,k
+1 and Yk+1,k+1 and ensuring that every proposed move in the
MCMC satisifes the zone connectivity condition
A simulation study:
two-wave snowball sampling
For the same fixed model:
edge
-4.0
alt-star
0.2
alt-triangle
1.0
alt-2-path -0.2
Size of network: 500
Sixe of random seed sets: 20, 30, 40, 50, 60, 500)
500 graphs sampled from the ERGM distribution
One snowball sample per graph
Complete networks
A graph from the distribution for n = 500:
average degree  4.8
Distributions of estimates (n = 500)
Simulation results for n = 500
Seed set
size
20
30
40
50
60
Mean
-3.143
-3.470
-3.444
-3.638
-3.713
Median
-3.677
-3.808
-3.747
-3.921
-3.973
IQR
4.203
3.217
3.090
2.709
2.707
Bias
0.857
0.530
0.556
0.362
0.287
RMSE
3.640
2.594
2.440
1.901
1.824
Coverage
probability
0.741
0.876
0.900
0.938
0.942
Alt-star
20
30
40
50
60
-0.022
0.065
0.058
0.124
0.148
0.128
0.147
0.167
0.201
0.222
1.426
1.009
1.004
0.844
0.862
-0.222
-0.135
-0.142
-0.076
-0.052
1.127
0.827
0.774
0.621
0.591
0.752
0.888
0.906
0.928
0.940
Alt-triangle
20
30
40
50
60
0.983
0.988
0.986
0.989
0.986
0.977
0.986
0.989
0.990
0.993
0.188
0.153
0.139
0.133
0.122
-0.017
-0.012
-0.014
-0.011
-0.014
0.153
0.119
0.112
0.102
0.095
0.962
0.952
0.948
0.942
0.946
Alt-2-path
20
30
40
50
60
-0.202
-0.204
-0.203
-0.208
-0.209
-0.197
-0.205
-0.200
-0.207
-0.210
0.130
0.097
0.092
0.077
0.076
-0.002
-0.004
-0.003
-0.008
-0.009
0.093
0.075
0.065
0.060
0.054
0.875
0.895
0.903
0.898
0.901
Effect
Edge
Simulation results:
preliminary summary
Occasionally, conditional estimation is difficult, presumably
because of the particular sample from the particular graph – we
are evaluating the effect of both forms of sample variability
However:
•  Bias and RMSE decline as seed set size increases
•  For a sufficiently large seed set size, bias is small
•  Bias is very small for the alt-triangle and alt-2-path effects, less
so for edge, alt-stars
But also
•  As network size increases, so does variability of estimates at
given seed set size
What if we ignore the sampling design?
MCMCMLEs from network on Z[2] = Z0  Z1  Z2
edge
alt-2-path
alt-star
alt-triangle
Example: Snowball sample (n=551) in Brimbank, Victoria:
pink = initial wave, blue = wave 1, green = wave 2, yellow =
nominees of wave 2
Spatial layout for network (individuals at
distances larger than 25 km excluded)
Conditional MLEs for 6 models based
on 2-wave sample (Daraganova, 2008)
model
Edge
alt-stars
alt-triangles
alt-2Paths
1
-4.86*
(0.20)
2
3.80*
(1.37)
3
-3.95*
(0.74)
-0.107
(0.9)
4
4.71*
(-0.11)
-0.11
(0.08)
5
ln(distance)
-1.05*
(0.17)
-6.52*
(0.918)
2.55*
(0.297)
-0.2*
(0.093)
6
-1.18
(2.49)
2.46*
(0.34)
-0.19*
(0.09)
7
-13.12*
(2.84)
1.92*
(0.82)
-0.24*
(0.33)
-0.24*
(0.09)
8
-7.21*
(2.79)
1.58
(0.85)
2.41*
(0.34)
-0.24*
(0.09)
-1.04*
(0.17)
-0.65*
(0.298)
-0.58
(0.31)
Other neighbourhood-based model specifications:
generalised dependence structures
Y(i,j) andY(k,l) are conditionally independent unless:
Strict p-inclusion (p  1)
SIp : Np{i}  {k,l} and Np{j}  {k,l}
 Np{k}  {i,j} and Np{l}  {i,j}
p-inclusion
Ip : Np{i,j}  {k,l} and Np({k,l})  {i,j}
Partial p-inclusion (p  1)
PIp : Np{i,j}  {k,l} or Np{k,l}  {i,j}
p-distance criterion
Dp : Np({i,j})  {k,l}  
 {i,j}  Np({k,l})  
:
any
walk of length  p
full proximity
symmetric
proximity
asymmetric
proximity
weak proximity
A hierarchy of models
p=0
SIp+1 : strict inclusion
Ip inclusion
p=1
PIp : partial inclusion
Dp : distance
p=2
p=3
...
I0 = PI0 (Bernoulli) The dependence hierarchy
SI1 (clique) D0 (Markov) I1 (social circuit) PI1 (edge-­‐
triangle) SIp (p-­‐club) D1 (3-­‐path) Cohesion
Ip (cyclic walk of length  2p
+2) PIp ((r+1)-­‐
path-­‐(2(p-r)
+1)-­‐cyclic walk, 0  r  p-­‐1) Closure
Brokerage
Dp (path of length  p)
Connectivity
I0 = PI0 (Bernoulli) The dependence hierarchy
and minimum requirements
for conditional estimation
SI1 (clique) D0 (Markov) I1 (social circuit) PI1 (edge-­‐
triangle) SIp (p-­‐club) D1 (3-­‐path) Cohesion
Key: requires
Green Y00
Aqua Y00+Y01
Yellow Y[1,1]
Red
Y[1,1]+Y12
Ip (cyclic walk of length  2p
+2) PIp ((r+1)-­‐
path-­‐(2(p-r)
+1)-­‐cyclic walk, 0  r  p-­‐1) Closure
Brokerage
Dp (path of length  p)
Connectivity
In conclusion
•  It is worth thinking carefully
about the forms of proximity
underpinning dependence
assumptions (but ultimately an
empirical question as to what is
required)
•  For neighbourhood-based
dependence structures, we can
isolate and (conditionally)
model part of a network where
multiple imputation may not be
possible or feasible
) Cohesion
Closure
Brokerage
Connectivity