Decentralized Hypothesis Testing on Graphs

Decentralized Hypothesis Testing
on Graphs
Angelia Nedić
Collaborative work with Alexander Olshevsky and Cesar Uribe
Industrial and Enterprise Systems Engineering Department
and Coordinated Science Laboratory
University of Illinois at Urbana-Champaign
June 22, 2016
Stochastic Networks Conference
San Diego
June 22, 2016
Centralized Case
Bayes’ Rule Belongs to
Stochastic Approximations
1
Stochastic Networks Conference
San Diego
June 22, 2016
Bayesian Learning - Hypothesis Testing
• An agent observes a certain phenomenon, and uses the observations to refine its
understanding of the state of the phenomenon.
• Let S be a finite set of the possible states for the phenomenon.
• An agent, receives ”noisy” observations s1 , s2 , . . . of the true state in discrete time
instances k = 1, 2, . . .
• The observation sequence {sk } is random i.i.d., drawn according to an (unknown)
distribution f
• Agent has a set of hypothesis {`1 (·), . . . , `m (·)} of probability distributions on S , and
would like to determine a hypothesis that the best describes the data collected over
time, i.e., determine an `i that is the closest to f in some sense
2
Stochastic Networks Conference
San Diego
June 22, 2016
Bayes’ Update: Inference Via Minimization Rule
• Initial distribution µ0 is selected at time t = 0.
• At time k, agent has a belief µk (probability distribution on {1, 2, . . . , m}) that the
best explains the observations s1, . . . , sk collected up to that time. At time k + 1, it
observes sk+1 and updates its belief to µk+1:
Sk
Sk+1
Bayesian update: for all i = 1, . . . , m,
µk-1
µk
µk+1
µik+1
k-1
k
k +1
µik `i (sk+1 )
= Pm
p
µ
p=1 k `p (sk+1 )
• Bayes’ rule minimizes a function composed of two terms (Walker 2006, Zellner 1988)
• Maximum Likelihood Estimation (MLE) of a state given the observed data and
• A regularization function Kullback-Leibler
(KL) divergence from 
the current prior



µk+1 = argmin −Eπ [ln `(sk+1 )] +DKL(πkµk )
{z
}

π∈∆m |
M LE
∆m is the set of m-dim. probability distributions `(sk+1 ) = [`1 (sk+1 ), . . . , `m (sk+1 )]0 ,
Pm
p
and DKL(pkq) = i=1 pi ln qi .
i
3
Stochastic Networks Conference
San Diego
June 22, 2016
Digression: Gradient-Projection/Mirror-Descent Method
• Minimize convex function F : Rm → R subject to closed convex set X ⊂ Rm
• X is assumed simple (the projection on the set is easy)
• Euclidean-norm, gradient projection
1
xk+1 = ΠX [xk − αk ∇F (xk )] ⇐⇒ xk+1 = argmin h∇F (xk ), yi +
ky − xk k2
2αk
y∈X
ΠX (y) is the Euclidean projection of y on X and αk > 0 is a stepsize.
• Extension from the use of the Euclidean norm to the use of a Bregman function:
Mirror-descent method
1
B(y, xk )
xk+1 = argmin h∇F (xk ), yi +
α
y∈X
k
where B(y, x) is a Bregman distance function
4
Stochastic Networks Conference
San Diego
June 22, 2016
Mirror-Descent Method using KL-divergence
Fact KL-divergence is a Bregman distance function on ∆m, induced by t 7→ t ln t
Hence, when X = ∆m
Mirror-Descent Method
1
xk+1 = argmin h∇F (xk ), yi +
DKL(y, xk )
α
y∈∆m
k
where B(y, x) is a Bregman distance function
Bayes’ Rule
µk+1 = argmin {−Eπ [ln `(sk+1 )] + DKL(πkµk )}
π∈∆m




= argmin −h ln `(sk+1 ) , πi + DKL(πkµk )
| {z }

π∈∆m 
sample gradient
Bayes’ Rule: stochastic mirror-descent method using KL-divergence (as Bregman
distance function) and a fixed stepsize!
• What optimization problem is being solved by Bayes’ Update Rule?
5
Stochastic Networks Conference
San Diego
June 22, 2016
Optimizing Expected Log-Likelihood

µk+1


= argmin −h ln `(sk+1 ) , πi + DKL(πkµk )
| {z }

π∈∆m 
sample gradient
The update rule is a stochastic mirror-descent method for solving LP
minimize
− hEf [ln `(s)], πi = −Ef [hln `(s), πi]
subject to
π ∈ ∆m .
Since f is unknown, we use samples of the gradients (expectation and gradient exchange)
The problem can be seen to be equivalent to
m
X
minimize
πi DKL (f k`i )
i=1
subject to
π ∈ ∆m .
Bayes’ update rule is a stochastic mirrordescent method for solving the above ”uncertain” LP problem, which is equivalent to
f
l |
l |
*
1
l |
n
min DKL (f k`i )
1≤i≤m
6
Stochastic Networks Conference
San Diego
June 22, 2016
Consistency of the Bayes’ update
m
X
minimize
πi DKL (f k`i )
i=1
subject to
π ∈ ∆m .
Equivalent to
min DKL (f k`i )
1≤i≤m
Let
M∗ =
i∗ | di∗ (f ) = min DKL (f k`i ) .
1≤i≤m
Assumptions:
• µj0 > 0 for at least one j ∈ M ∗ (e.g., choose the uniform distribution initially)
• supp(f ) ⊆ supp(`i ) for all hypotheses i
Result 1: Under these assumptions
lim µjk = 0
k→∞
almost surely for all j 6∈ M ∗.
7
Stochastic Networks Conference
San Diego
June 22, 2016
Convergence rate of Bayes’ update
Result: Under the preceding assumptions, we have: for any given ρ > 0, there is an
integer N (ρ) such that with probability 1 − ρ, we have
n γ
o
2
j
µk ≤ exp −k + γ1
for all j 6∈ M ∗,
2
where


2
1
8 (ln (α)) ln ρ

N (ρ) , 
+ 1
,
2
γ


2
α is a lower bound on the likelihoods `i on the support of f :
α = min
min
i∈[m] j∈supp(`i )
(
γ1 ,
max
j∈[m]\M ∗
i∗ ∈M ∗
ln
µj0
∗
µi0
`i (sj )
)
γ2 ,
min
j∈[m]\M ∗
DKL(f k`j ) − DKL (f k`i∗ )
Note that the expression for γ1 suggests that using the uniform prior is a good choice.
γ2 captures how well we differentiate correct from wrong - affects the rate and N (ρ).
Rate is exponential with high probability for large enough k.
8
Stochastic Networks Conference
San Diego
June 22, 2016
Bayes Rule - Optimization
• Simplex method - finite convergence but for deterministic problem
• First-order stochastic methods (simple form) have a sub-linear rate as the best
convergence rate
• In the case of Bayes’ setting we have linear rate with high probability after some time
• What other (stochastic) optimization problems share this property?
9
Stochastic Networks Conference
San Diego
June 22, 2016
Decentralized Hypothesis Testing
Recently the Bayes’ Rule was used for distributed hypothesis testing over graphs:
• Jadbabaie, Molavi, and Tahbaz-Salehi 2013, 2015
• Lalitha, Sarwate, and Javidi 2015
• Uribe, Olshevsky, AN 2015
• Shahrampour and Jadbabaie 2013, Shahrampour, Rakhlin and Jadbabaie 2015:
connection made between Bayes’ rule and Nesterov Dual Averaging Method (optimizing
”log-likehihood” view)








*
+


k+1


X
1
µk+1 = argmin −
ln `(st ) , π +
DKL(πkµ0 )
|
{z
}

ηk
π∈∆m 

t=1


prox−term


|
{z
}


gradient samples
and their distributed algorithm is based on this interpretation
• Our work and Lalitha, Sarwate, and Javidi 2015 builds distributed model based on the
interpretation of the Bayes’ rule as a stochastic mirror descent.
10
Stochastic Networks Conference
San Diego
June 22, 2016
Extensions to Distributed Setting
• A system of n agents, each with its own collection of distributions `i (· | θ) over m
hypothesis Θ = {θ1, . . . , θm}
• Each agent receives private observations sik of the unknown state with probability
distribution fi
• Agents communicate connected over time varying graphs G(t), t = 1, 2 . . .
• Learning Objective: Jointly determine a hypothesis θ ∈ Θ that explains best the
agent system data
n
X
min
DKL fi k`i (· | θ)
θ∈Θ
i=1
• Let
Θ∗i = argmin DKL fi k`i (· | θ)
θ∈Θ
and Θ∗ = ∩ni=1Θ∗i
Consistency All agents agree on hypotheses in Θ∗ that best describes the observations
for all.
11
Stochastic Networks Conference
San Diego
June 22, 2016
Assumptions: Graphs, Likelihoods and Initial Beliefs
Assumption (Graphs) Graph sequence {Gk } and a matrix sequence {Ak } are such that:
1. Ak is row-stochastic with [Ak ]ij > 0 if (j, i) ∈ Ek and [Ak ]ii > 0.
2. If [Ak ]ij > 0 then [Ak ]ij > η for some positive scalar η .
3. n
{Gk } is B -stronglyo connected, i.e., here is an integer B ≥ 1 such that the graph
S
V, (k+1)B−1
Ei is strongly connected for all k ≥ 0.
i=kB
Assumption (Likelihood Models)
n
\
∗
Θ ,
argmin DKL fi (·) k`i (·|θ) is nonempty
i=1
θ∈Θ
Assumption (Initial Beliefs) For all agents i = 1, . . . , n,
1. The prior beliefs on all θ∗ ∈ Θ∗ are positive, i.e. µi0 (θ∗) > 0 for all θ∗ ∈ Θ∗.
i
i
2. There exists an α > 0 such that ` s |θ > α for all outcomes si and all θ ∈ Θ.
12
Stochastic Networks Conference
San Diego
June 22, 2016
Consistency
Proposition 1 Under these assumptions, the update rule


n


i i
X
j
i
[Ak ]ij DKL(πkµk )
µk+1 = argmin −Eπ ln ` (sk+1 |·) +


π∈Π(Θ)
j=1
or explicitly
j
[Ak ]ij i i
` (sk+1 |θ)
j=1 µk (θ)
Pm Qn
j
[Ak ]ij i i
µ
(θ)
` (sk+1 |θr )
r=1
j=1 k
Qn
µik+1 (θ) =
generates sequences {µik }, i = 1, . . . , m, of beliefs such that with probability 1, for all
agents i:
lim µik (θ) = 0 for all θ ∈
/ Θ∗
Proof: Choose a
θ∗
∈
Θ∗
k→∞
and define
ϕik (θ) = ln
µik (θ)
µik (θ∗ )
13
Stochastic Networks Conference
San Diego
June 22, 2016
From the update rule, we obtain for all i = 1, . . . , n
!
n
i
i
X
` (sk+1 |θ)
ϕik+1 (θ) =
[Ak ]ij ϕjk (θ) + ln
`i (sik+1 |θ∗ )
j=1
for all θ ∈ Θ
Stacking ϕik+1, for i = 1, . . . , n in a vector
ϕk+1 (θ) = Ak ϕk (θ) + Lk (θ)
for all θ ∈ Θ
after some manipulations akin to ”consensus”-type analysis, we find that, almost surely,
coordinate-wise we have
δ
1
/ Θ∗
lim ϕk+1 (θ) ≤ − kH(θ)k1 1 for all θ ∈
k→∞ k
n
where
[H(θ)]i = DKL(fi k`i (·|θ)) − DKL(fi k`i (·|θ∗ ))
for i = 1, . . . , n
and δ is a bound for the ”influence imbalance for the chain {Ak }
• δ = inf t≥0 mini∈[n] [At · · · A0 1n ]i
• when the matrices Ak are doubly stochastic δ = 1
14
Stochastic Networks Conference
San Diego
June 22, 2016
Non-Asymptotic Learning Rate
Proposition 2 Let Assumptions 1-3 hold. Also, let ρ ∈ (0, 1) and consider the update
rule


n


X
j
µik+1 = argmin −Eπ ln `i (sk+1 |·) +
[Ak ]ij DKL(πkµk )


π∈Π(Θ)
j=1
Then, the following property is true: for any θ 6∈ Θ∗, there is an integer N (ρ) such that,
with probability 1 − ρ, for all k ≥ N (ρ) there holds
k
µik (θ) ≤ exp − γ2 + γ1
∀i = 1, . . . , n,
2
where


2
1
8 (ln (α)) ln ρ

+ 1
,
N (ρ) , 

2
γ2


α is a lower bound on the likelihoods `i (see Assum. 3),
C
µi0 (θ)
kH (θ) k1
γ1 , max
max ln i ∗ +
θ∈Θ\Θ∗ 1≤i≤n
µ0 (θ )
1−λ
θ∗ ∈Θ∗
γ2 ,
δ
n
min kH (θ) k1 .
θ∈Θ\Θ∗
[H (θ)]i = DKL(fi k`i (·|θ)) − DKL (fi k`i (·|θ∗ ))
The constants C , δ and λ are related to the graphs and satisfy the following relations:
15
Stochastic Networks Conference
San Diego
For general B -connected graph sequences {Gk },
1
nB B
C = 2,
λ≤ 1−η
,
June 22, 2016
δ ≥ η nB .
If every matrix Ak is doubly stochastic,
1
√
η B
,
δ = 1.
C = 2,
λ= 1−
4n2
If each Gk is an undirected graph and each Ak is the lazy Metropolis matrix, i.e. the
stochastic matrix which satisfies
1
[Ak ]ij =
for all {i, j} ∈ Gk ,
2 max(d(i), d(j))
then
√
1
C = 2,
λ=1−
,
δ = 1.
Θ(n2 )
16
Stochastic Networks Conference
San Diego
June 22, 2016
Allowing Conflicting Models
Relaxing the assumption that
n
\
i
argmin DKL fi (·) k` (·|θ) is nonempty
i=1
θ∈Θ
Note that
Θ∗ = argmin
θ∈Θ
n
X
DKL fi k`i (· | θ)
i=1
is non-empty and the agents’ beliefs will vanish for all θ 6∈ Θ∗ as long as the matrices Ak
satisfy some additional properties. Three different settings:
1. Time-varying undirected graphs: Ak is doubly-stochastic with [Ak ]ij > 0 if {i, j} ∈ Ek .


 1j if j ∈ Nki
2. Time-varying directed graphs [Ak ]ij = dk

0 if otherwise
dik is the out degree of node i at time k, Nki is the set of in-neighbors of node i.
(
1
if {i, j} ∈ E ,
max{di ,dj }
3. Acceleration in static graphs Āij =
0
if {i, j} ∈
/ E,
di degree of the node i and Ā = 12 I + 21 A,
17
Stochastic Networks Conference
San Diego
June 22, 2016
Learning Rules
Time-varying undirected graphs
µik+1 (θ)
=
1
n
Y
i
Zk+1
j=1
µjk (θ)
[Ak ]ij
`i sik+1 |θ ,
(1)
i
is a normalization factor, i.e.,
Zk+1
i
Zk+1
=
m Y
n
X
[A ]
µjk (θp ) k ij `i sik+1 |θp .
p=1 j=1
Time-varying directed graphs
µik+1 (θ) =
i
yk+1
=
1
n
Y
i
Z̃k+1
i=1
n
X
[Ak ]ij ykj
j
µk (θ)
`i
!
sik+1 |θ
1
yi
k+1
(2)
j
[Ak ]ij yk
i=1
18
Stochastic Networks Conference
San Diego
June 22, 2016
Acceleration in static graphs based on a recent paper by Olshevsky 2014
n
Q
µjk (θ)(1+σ)Aij `i sik+1 |θ
1 j=1
µik+1 (θ) = i
σAij
n Ẑk+1 Q
j
j
j
µk−1 (θ) ` sk |θ
i
Ẑk+1
=
j=1
n
Q
j
(1+σ)Aij i
i
|θ
µ
`
s
)
(θ
p
m
k+1 p
k
X
j=1
σAij
n Q
j
j
p=1
µk−1 (θp ) `j sk |θp
j=1
(3)
(4)
where σ = 1 − 2/(9U + 1) and U ≥ n.
19
Stochastic Networks Conference
San Diego
June 22, 2016
Acceleration in Static Graphs: Rate Result
Theorem 1 Then, the accelerated update rule with the initial condition µi−1 (θ) =
µi0 (θ) has the following property: there is an integer N (ρ) such that, with probability
1 − ρ, for all k ≥ N (ρ) and for all θv ∈
/ Θ∗ , there holds
k
µik (θv ) ≤ exp − γ2 + γ1
for all i = 1, . . . , n,
2
where

2
72 (ln (α)) n ln

1
ρ
,
N (ρ) , 


2
γ2


)
( √
n n
X
X
2
i i
i i
γ1 , max max
DKL(f k` (· | θv )) −
DKL(f k` (· | θw ))
θv ∈Θ
/ ∗ θw ∈Θ̂∗
1 − λ i=1 i=1
!
n
n
X
X
1
γ2 =
max
DKL(f i k`i (· | θv )) −
DKL(f i k`i (· | θ∗ )) ,
/ ∗
n θv ∈Θ
i=1
i=1
1
with α from the Assumption on likelihoods and λ = 1 − 18U
.
Note The index N (ρ) and constant γ2 do not depend on the graph structure!
20
Stochastic Networks Conference
San Diego
104
104
103
102
Mean number of Iterations
105
Mean number of Iterations
Mean number of Iterations
105
June 22, 2016
104
103
102
101
101
0
100
200
300
Number of nodes
(a) Path Graph
0
100
200
300
Number of nodes
(b) Circle Graph
103
102
101
60
100
140
Number of nodes
(c) Grid Graph
Figure 1: Empirical mean over 50 Monte Carlo runs of the number
of iterations required for µik (θ) < 0.01 for all agents on θ ∈
/ Θ∗.
All agents but one have all their hypotheses to be observationally
equivalent. Dotted line for the algorithm proposed in [Jadbabaie
2012], Dashed line for the basic algorithm and solid line for the
accelerated-procedure.
21
Stochastic Networks Conference
San Diego
June 22, 2016
Distributed Source Localization
22
Stochastic Networks Conference
10
θ7
San Diego
θ8
June 22, 2016
θ9
2
3
y-position
1
0
θ4
θ5
θ6
0.15
l2 (·|θ5 )
l2 (·|θ3 )
f 2 (·)
0.1
Agents
Source
θ1
-10
-10
0.05
θ2
0
x-position
(a) Network of Agents
θ3
10
0
0
5
10
15
Distance
20
25
(b) Hypothesis Distributions
Figure 2: Figure (a) shows a group of 3 agents in a grid of 3 × 3
hypotheses. Each hypothesis corresponds to a possible location of
the source. For example, hypothesis θ2 locates the source at the
(−10, 0) point in the plane. Figure (b) shows the likelihood functions
for θ2 and θ5 and distribution of observations f 2 for agent 2.
23
10
9
8
7
6
5
4
3
2
1
0
-1
-2
-3
-4
-5
Normal
-6
No Sensor
-7
Conflicting
-8
Source
-9
-10
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
x-position
(a) Network of Agents
San Diego
June 22, 2016
1
0.8
Eq. (1)
Eq. (2)
[34]
[13]
0.6
µ(θ * )
y-position
Stochastic Networks Conference
0.4
0.2
0
100
101
102
103
Time
(b) Beliefs on the optimal hypothesis
Figure 3: Figure (a) shows a network of heterogeneous agents. 4
indicates agents whose observations have been modified such that the
optimal hypothesis is the (0, 0) point in the grid. indicates agents
for whom all hypothesis are observationally equivalent (i.e. no data
is measured). ◦ indicates regular agents with correct observation
models and informative hypothesis. Figure (b) shows the belief
evolution on the optimal hypothesis θ∗ for different belief update
24
protocols.
Stochastic Networks Conference
San Diego
June 22, 2016
Closely Related
• Shahrampour, Rakhlin, and Jadbabaie 2014
• Lalitha, Javidi, and Sarwate 2014, 2015
• N. O. Uribe 2014, 2015
on arxiv: Fast Convergence Rates for Distributed Non-Bayesian Learning
25
Stochastic Networks Conference
San Diego
June 22, 2016
Future directions
• Time varying parameter θ∗
• Time varing distributions fti for agent observations
• Infinitely many hypotheses
Thank you
26