Approximate Graph Matching

Approximate Graph
Matching
R. Srikant
ECE/CSL
UIUC
Coauthor
Joseph Lubars
Problem Statement
Given two correlated graphs…
One with known node identities,
One with unknown (or incorrect)
node identities…
Goal: Infer the identities of the nodes in the second graph
Problem
Given two correlated graphs…
One with known node identities,
One with unknown (or incorrect)
node identities…
Goal: Infer the identities of the nodes in the second graph
Computational Complexity Requirement
We are interested in very large graphs:
Problem Goes by Many Names
• Approximate Graph Matching
• Random Graph Isomorphism: Special case
• Network Deanonymization: Privacy
• Network Alignment: Biology
•…
Application 1: Social Networks
Friendship Graph
Bob
Alice
Carol
Bob
Bob
Alice
Carol
Sample edges from an
underlying friendship graph
to obtain social networks
Alice
Carol
Application 1: Social Networks
?
Bob
?
Alice
?
Carol
Use the graph topology of one social
network to deanonymize members
of another network
Application 2: Protein Interaction
Human Network
Mouse Network
Q8WUU5
Q920S3
P06436
P58391
Q9Y365
P62805
Q9JMD3
P62806
Find proteins with similar functions across
different species based on the topologies of
their interaction networks
Application 3: Wikipedia Articles
English Wikipedia
French Wikipédia
Hydrosphère
Hydrosphere
Sun
Soleil
Terre
Earth
Solar System
Système solaire
Supercontinent
Supercontinent
Automatically find or correct corresponding
articles in different versions of Wikipedia based
on the graph of article links.
Mathematical Model
𝐺1
Note: permuting the
node identities or giving
them different identities,
or erasing the node
identities, are all
equivalent
𝐺2
𝜋
Start with a random graph 𝐺
(generated according to
some random graph model)
Independently sample each edge
of 𝐺 twice with probability 𝑠
Permute node labels of one graph
Prior Results
Special Case: 𝑠 = 1
• In this case, the problem reduces to the graph isomorphism problem
• find a permutation of the nodes of 𝐺2 that produces 𝐺1 exactly w.h.p.
• results below are for Erdős–Rényi graphs (each edge occurs with probability 𝑝)
• Result 1 (Erdős and Rényi, 1963): The identity permutation is the unique solution
w.h.p. (i.e., Erdős–Rényi graphs have no nontrivial automorphisms w.h.p.)
• Result 2 (Babai, Erdős, and Seklow, 1980): There is a linear time algorithm to find
the correct permutation (random graph isomorphism) with high probability.
General Case: Permutation Matrices
𝐴
𝑃𝑇 𝐴𝑃
• Permuting the node labels of a graph changes its adjacency matrix
• If 𝐴 is the original adjacency matrix, the new adjacency matrix can be represented
as 𝑃𝑇 𝐴𝑃, where 𝑃 is a permutation matrix (consists of exactly one 1 in each
column and one 1 in each row, and 0s elsewhere)
Mismatch Metric
• General case (𝑠 < 1): Permute the nodes of 𝐺2
so that the adjacency matrices of 𝐺1 and 𝐺2
are as similar as possible
𝐴1
• In other words, minimize the mismatch:
min 𝐴1 − 𝑃𝑇 𝐴2 𝑃
𝑃∈𝒫
2
𝐹
𝐴𝑖 : Adjacency matrix for 𝐺𝑖
𝒫: Set of permutation matrices
• The solution produces the correct permutation
w.h.p. for many graph models (e.g., CullinaKiyavash, 2016). However, computationally
infeasible
𝐴2
𝑃 𝑇 𝐴2 𝑃
Convex Relaxation
• Change the problem to:
min 𝐴1 − 𝑃𝑇 𝐴2 𝑃
𝑃∈𝐷
2
𝐹
• 𝐷 is the set of doubly stochastic matrices,
which is the convex hull of the set of
permutation matrices
𝐴1
• Algorithm: Solve this problem and project
the result onto the set of permutation
matrices 𝒫
• No guarantees for recovery
𝐴2
𝑃 𝑇 𝐴2 𝑃
Seed-Based Approaches (Narayanan-Shmatikov, 2009)
Many recent algorithms assume that some node identities are known in 𝐺2 . These
identities are used to discover all of the remaining node identities in the graph:
Result 1 (Yartseva and Grossglauser, 2013): For 𝐺(𝑛, 𝑝), if
s2
3
+𝜖
𝑛4
≫
𝑝𝑠 2
log 𝑛
≫
1
,
𝑛
and at least
3
6
4 𝑛 𝑝𝑠 2 4
1
3
initial “seed” identities are known, then the algorithm recovers the identities of 𝑛 −
𝑜 𝑛 nodes with high probability
Seed-Based Approaches
Many recent algorithms assume that some node identities are known in 𝐺2 . These
identities are used to discover all of the remaining node identities in the graph:
Result 2 (Korula and Lattanzi, 2013):
• A similar algorithm to result 1, but slightly worse theoretical guarantees for
Erdős–Rényi graphs
• The authors also show that the algorithm recovers 97% of the identities if 𝐺 is
generated according to a certain preferential attachment model
Seed-Based Approaches
Many recent algorithms assume that some node identities are known in 𝐺2 . These
identities are used to discover all of the remaining node identities in the graph:
Result 3 (Kazemi, Hassani, and Grossglauser, 2015):
• A similar algorithm to result 1, but allows for erroneous initial seed identities
• Makes 𝑜(𝑛) errors
Our Model/Results
Our Model
𝐺
• Stochastic Block Model
𝑝
• Two communities 𝐶1 and 𝐶2
• Edge probability 𝑝 within the
same community
• Edge probability 𝑞 across
communities
• Assume 𝑝 > 𝑞
𝐶1
𝑞
𝐶2
𝑝
Our Model
𝐺
• We start with two independently edge
sampled copies of a graph generated by a
stochastic block model
• A fraction 𝛽 of the node labels of 𝐺2 are
correct, but the rest are incorrect (call
this permutation 𝜋). But we don’t know
which labels are correct
• Problem: Find the correct labels of the
nodes of 𝐺2
Sample Edges
𝐺1
Permute
node labels
𝐺2
Motivation
• Problem: we are given a partially correct permutation (𝜋), we have to make it fully
correct
• Use convex relaxation (or some other method) to produce a preliminary estimate of
the correspondence between the nodes of the two graphs (𝜋)
• In practice, some fraction of the nodes (𝛽) will be correctly matched, some will not.
But we don’t know which nodes are correctly matched
• Key Assumption: The correct and incorrect matches are distributed uniformly at
random
Main Result
Suppose nodes are placed into community 𝐶1 with probability 𝛼 and
1
community 𝐶2 with probability 1 − 𝛼, with 𝛼 ≥ .
2
Then, if
2
𝑠 𝛽 𝛼𝑞 + 1 − 𝛼 𝑝 >
16 log 𝑛
𝑛
, 𝑝 = 𝑜 1 , and 𝛽 1 − 𝛼 >
48 log 𝑛
𝑛
then there exists an efficient algorithm which recovers the correct
permutation exactly with high probability.
,
Main Result
Suppose nodes are placed into community 𝐶1 with probability 𝛼 and
1
community 𝐶2 with probability 1 − 𝛼, with 𝛼 ≥ .
2
Then, if
Node degree is large enough for connectivity
But not too large to make the graph densely connected
Fraction of correctly matched nodes is not too small,
then there exists an efficient algorithm which recovers the correct
permutation exactly with high probability.
The Algorithm: Witnesses (Korula-Lattanzi)
• If 𝑎 is matched to 𝑎’, and 𝑎 is a neighbor
of 𝑢 and 𝑎’ is a neighbor of 𝑣, then 𝑎-𝑎’ is
a witness for 𝑢-𝑣
• The more witnesses we have, the more
confident we are about a match
• We wish to match nodes in 𝐺1 with nodes
in 𝐺2 in a way that maximizes the total
number of witnesses
𝐺2
𝐺1
• Question: Is 𝑢 in 𝐺1 the same as 𝑣 in 𝐺2 ?
𝑎
𝜋
𝑢
𝑎′
𝑣
𝑤 𝑢, 𝑣 = 2
MWM on Bipartite Graphs
Bipartite graph, with the nodes of 𝐺1 on the
left and those of 𝐺2 on the right:
1. For every 𝑢 ∈ 𝑉1 , 𝑣 ∈ 𝑉2 , calculate the
number of witnesses 𝑤 𝑢, 𝑣 . This is the
weight of the edge 𝑢, 𝑣
𝑎
2. Find the maximum weighted matching
𝑏
Complexity of a naïve implementation: 𝑂(𝑛3 ).
Can we reduce it?
𝑉2
𝑉1
𝑤(𝑎, 𝑥)
𝑥
𝑦
𝑐
𝑤(𝑐, 𝑧)
𝑧
Step 1: Efficiently calculating 𝑤(𝑢, 𝑣)
𝑉2
𝑉1
In parallel, for each 𝑢, we can efficiently
calculate the number of witnesses
𝑤(𝑢, 𝑣) for every 𝑣 as follows:
The number of paths from 𝑢 to 𝑣 in the
graph on the right gives 𝑤(𝑢, 𝑣)
New complexity: 𝑂( 𝐸1 Δ2 ), where Δ𝑖 is
the max degree of a node in graph 𝐺𝑖
𝑤 𝑢, 𝑡 = 1
𝑡
𝜋
𝑎
𝑎′
𝑣
𝑢
𝑏
𝑐
𝑤 𝑢, 𝑣 = 3
𝑏′
𝑦
𝑤 𝑢, 𝑦 = 2
𝑧
𝑤 𝑢, 𝑧 = 1
𝑐′
Step 2: Greedy Matching, instead of MWM
• Add the match with the largest number of
witnesses, remove conflicts, add the
match with the next largest number of
witnesses,…
• Why does it work?
• On the right: the weights of the bipartite
graph as a matrix, with the true 𝜋 as the
identity. (Dark blue: low values, dark red:
high values)
• The diagonal entries tend to dominate
their respective rows and columns.
Why Does Greedy Matching Work?
• E-R graph with 𝑝 being the probability of an edge (Yartseva-Grossglauser, KorulaLattanzi). Each edge appears in 𝐺1 , 𝐺2 with probability 𝑠
• Probability that 𝑥 is a witness for (𝑢, 𝑣) if it is a correct match: 𝑝𝑠 2
𝑝
𝑢
𝑥
• Probability that 𝑥 is a witness for (𝑢, 𝑣) if it is an incorrect match: 𝑝2 𝑠 2
𝑢
𝑝
𝑥
𝑝
𝑣
• If 𝑝 is small, 𝑝𝑠 2 ≫ 𝑝2 𝑠 2 , so the number of witnesses for (𝑢, 𝑣) is much larger on
average for a correct match
Simulations
• In practice, the algorithm can be run repeatedly
• Suppose 10% of the matches are correct initially, then by running the algorithm
once, one may increase this to something larger than 10%
• Run it again, increase the number of correct matches
• Repeat several times….
• Threshold phenomenon: if the initial number of correct matches is small, doesn’t
help; otherwise, can match “all” nodes correctly
E-R Graphs
Running the Algorithm Once
Fraction of initially correct matches
Running the Algorithm Iteratively
Fraction of initially correct matches
Performance on Various Graph Models
Stochastic Block Model
Fraction of initially correct matches
Barabási-Albert Model
Fraction of initially correct matches
Possible Algorithm for Seedless Matching
𝐺1 , 𝐺2
Seedless Algorithm
(e.g., Convex
Relaxation Approach)
𝜋
Witness-Based
Correction Technique
𝜋
Seedless Matching In Practice
Our algorithm on Barabási-Albert graphs for varying values of 𝑠
Conclusions
• Can we recover all node identities w.h.p. with no seeds?
• No low-complexity algorithm is known for any random graph model for 𝑠 < 1
• For a stochastic block model, if we are given an initial estimate of the correct
permutation, then exact matches can be found w.h.p.
• The initial estimate must contain a fraction of nodes that are correctly matched
but we don’t need to know which matches are correct
• Motivation: convex relaxation or other approaches can provide an initial
estimate
• Assumption: the correct and incorrect matches are uniformly distributed
• Can remove this assumption, but much weaker theoretical guarantees