Approximate Graph Matching R. Srikant ECE/CSL UIUC Coauthor Joseph Lubars Problem Statement Given two correlated graphs… One with known node identities, One with unknown (or incorrect) node identities… Goal: Infer the identities of the nodes in the second graph Problem Given two correlated graphs… One with known node identities, One with unknown (or incorrect) node identities… Goal: Infer the identities of the nodes in the second graph Computational Complexity Requirement We are interested in very large graphs: Problem Goes by Many Names • Approximate Graph Matching • Random Graph Isomorphism: Special case • Network Deanonymization: Privacy • Network Alignment: Biology •… Application 1: Social Networks Friendship Graph Bob Alice Carol Bob Bob Alice Carol Sample edges from an underlying friendship graph to obtain social networks Alice Carol Application 1: Social Networks ? Bob ? Alice ? Carol Use the graph topology of one social network to deanonymize members of another network Application 2: Protein Interaction Human Network Mouse Network Q8WUU5 Q920S3 P06436 P58391 Q9Y365 P62805 Q9JMD3 P62806 Find proteins with similar functions across different species based on the topologies of their interaction networks Application 3: Wikipedia Articles English Wikipedia French Wikipédia Hydrosphère Hydrosphere Sun Soleil Terre Earth Solar System Système solaire Supercontinent Supercontinent Automatically find or correct corresponding articles in different versions of Wikipedia based on the graph of article links. Mathematical Model 𝐺1 Note: permuting the node identities or giving them different identities, or erasing the node identities, are all equivalent 𝐺2 𝜋 Start with a random graph 𝐺 (generated according to some random graph model) Independently sample each edge of 𝐺 twice with probability 𝑠 Permute node labels of one graph Prior Results Special Case: 𝑠 = 1 • In this case, the problem reduces to the graph isomorphism problem • find a permutation of the nodes of 𝐺2 that produces 𝐺1 exactly w.h.p. • results below are for Erdős–Rényi graphs (each edge occurs with probability 𝑝) • Result 1 (Erdős and Rényi, 1963): The identity permutation is the unique solution w.h.p. (i.e., Erdős–Rényi graphs have no nontrivial automorphisms w.h.p.) • Result 2 (Babai, Erdős, and Seklow, 1980): There is a linear time algorithm to find the correct permutation (random graph isomorphism) with high probability. General Case: Permutation Matrices 𝐴 𝑃𝑇 𝐴𝑃 • Permuting the node labels of a graph changes its adjacency matrix • If 𝐴 is the original adjacency matrix, the new adjacency matrix can be represented as 𝑃𝑇 𝐴𝑃, where 𝑃 is a permutation matrix (consists of exactly one 1 in each column and one 1 in each row, and 0s elsewhere) Mismatch Metric • General case (𝑠 < 1): Permute the nodes of 𝐺2 so that the adjacency matrices of 𝐺1 and 𝐺2 are as similar as possible 𝐴1 • In other words, minimize the mismatch: min 𝐴1 − 𝑃𝑇 𝐴2 𝑃 𝑃∈𝒫 2 𝐹 𝐴𝑖 : Adjacency matrix for 𝐺𝑖 𝒫: Set of permutation matrices • The solution produces the correct permutation w.h.p. for many graph models (e.g., CullinaKiyavash, 2016). However, computationally infeasible 𝐴2 𝑃 𝑇 𝐴2 𝑃 Convex Relaxation • Change the problem to: min 𝐴1 − 𝑃𝑇 𝐴2 𝑃 𝑃∈𝐷 2 𝐹 • 𝐷 is the set of doubly stochastic matrices, which is the convex hull of the set of permutation matrices 𝐴1 • Algorithm: Solve this problem and project the result onto the set of permutation matrices 𝒫 • No guarantees for recovery 𝐴2 𝑃 𝑇 𝐴2 𝑃 Seed-Based Approaches (Narayanan-Shmatikov, 2009) Many recent algorithms assume that some node identities are known in 𝐺2 . These identities are used to discover all of the remaining node identities in the graph: Result 1 (Yartseva and Grossglauser, 2013): For 𝐺(𝑛, 𝑝), if s2 3 +𝜖 𝑛4 ≫ 𝑝𝑠 2 log 𝑛 ≫ 1 , 𝑛 and at least 3 6 4 𝑛 𝑝𝑠 2 4 1 3 initial “seed” identities are known, then the algorithm recovers the identities of 𝑛 − 𝑜 𝑛 nodes with high probability Seed-Based Approaches Many recent algorithms assume that some node identities are known in 𝐺2 . These identities are used to discover all of the remaining node identities in the graph: Result 2 (Korula and Lattanzi, 2013): • A similar algorithm to result 1, but slightly worse theoretical guarantees for Erdős–Rényi graphs • The authors also show that the algorithm recovers 97% of the identities if 𝐺 is generated according to a certain preferential attachment model Seed-Based Approaches Many recent algorithms assume that some node identities are known in 𝐺2 . These identities are used to discover all of the remaining node identities in the graph: Result 3 (Kazemi, Hassani, and Grossglauser, 2015): • A similar algorithm to result 1, but allows for erroneous initial seed identities • Makes 𝑜(𝑛) errors Our Model/Results Our Model 𝐺 • Stochastic Block Model 𝑝 • Two communities 𝐶1 and 𝐶2 • Edge probability 𝑝 within the same community • Edge probability 𝑞 across communities • Assume 𝑝 > 𝑞 𝐶1 𝑞 𝐶2 𝑝 Our Model 𝐺 • We start with two independently edge sampled copies of a graph generated by a stochastic block model • A fraction 𝛽 of the node labels of 𝐺2 are correct, but the rest are incorrect (call this permutation 𝜋). But we don’t know which labels are correct • Problem: Find the correct labels of the nodes of 𝐺2 Sample Edges 𝐺1 Permute node labels 𝐺2 Motivation • Problem: we are given a partially correct permutation (𝜋), we have to make it fully correct • Use convex relaxation (or some other method) to produce a preliminary estimate of the correspondence between the nodes of the two graphs (𝜋) • In practice, some fraction of the nodes (𝛽) will be correctly matched, some will not. But we don’t know which nodes are correctly matched • Key Assumption: The correct and incorrect matches are distributed uniformly at random Main Result Suppose nodes are placed into community 𝐶1 with probability 𝛼 and 1 community 𝐶2 with probability 1 − 𝛼, with 𝛼 ≥ . 2 Then, if 2 𝑠 𝛽 𝛼𝑞 + 1 − 𝛼 𝑝 > 16 log 𝑛 𝑛 , 𝑝 = 𝑜 1 , and 𝛽 1 − 𝛼 > 48 log 𝑛 𝑛 then there exists an efficient algorithm which recovers the correct permutation exactly with high probability. , Main Result Suppose nodes are placed into community 𝐶1 with probability 𝛼 and 1 community 𝐶2 with probability 1 − 𝛼, with 𝛼 ≥ . 2 Then, if Node degree is large enough for connectivity But not too large to make the graph densely connected Fraction of correctly matched nodes is not too small, then there exists an efficient algorithm which recovers the correct permutation exactly with high probability. The Algorithm: Witnesses (Korula-Lattanzi) • If 𝑎 is matched to 𝑎’, and 𝑎 is a neighbor of 𝑢 and 𝑎’ is a neighbor of 𝑣, then 𝑎-𝑎’ is a witness for 𝑢-𝑣 • The more witnesses we have, the more confident we are about a match • We wish to match nodes in 𝐺1 with nodes in 𝐺2 in a way that maximizes the total number of witnesses 𝐺2 𝐺1 • Question: Is 𝑢 in 𝐺1 the same as 𝑣 in 𝐺2 ? 𝑎 𝜋 𝑢 𝑎′ 𝑣 𝑤 𝑢, 𝑣 = 2 MWM on Bipartite Graphs Bipartite graph, with the nodes of 𝐺1 on the left and those of 𝐺2 on the right: 1. For every 𝑢 ∈ 𝑉1 , 𝑣 ∈ 𝑉2 , calculate the number of witnesses 𝑤 𝑢, 𝑣 . This is the weight of the edge 𝑢, 𝑣 𝑎 2. Find the maximum weighted matching 𝑏 Complexity of a naïve implementation: 𝑂(𝑛3 ). Can we reduce it? 𝑉2 𝑉1 𝑤(𝑎, 𝑥) 𝑥 𝑦 𝑐 𝑤(𝑐, 𝑧) 𝑧 Step 1: Efficiently calculating 𝑤(𝑢, 𝑣) 𝑉2 𝑉1 In parallel, for each 𝑢, we can efficiently calculate the number of witnesses 𝑤(𝑢, 𝑣) for every 𝑣 as follows: The number of paths from 𝑢 to 𝑣 in the graph on the right gives 𝑤(𝑢, 𝑣) New complexity: 𝑂( 𝐸1 Δ2 ), where Δ𝑖 is the max degree of a node in graph 𝐺𝑖 𝑤 𝑢, 𝑡 = 1 𝑡 𝜋 𝑎 𝑎′ 𝑣 𝑢 𝑏 𝑐 𝑤 𝑢, 𝑣 = 3 𝑏′ 𝑦 𝑤 𝑢, 𝑦 = 2 𝑧 𝑤 𝑢, 𝑧 = 1 𝑐′ Step 2: Greedy Matching, instead of MWM • Add the match with the largest number of witnesses, remove conflicts, add the match with the next largest number of witnesses,… • Why does it work? • On the right: the weights of the bipartite graph as a matrix, with the true 𝜋 as the identity. (Dark blue: low values, dark red: high values) • The diagonal entries tend to dominate their respective rows and columns. Why Does Greedy Matching Work? • E-R graph with 𝑝 being the probability of an edge (Yartseva-Grossglauser, KorulaLattanzi). Each edge appears in 𝐺1 , 𝐺2 with probability 𝑠 • Probability that 𝑥 is a witness for (𝑢, 𝑣) if it is a correct match: 𝑝𝑠 2 𝑝 𝑢 𝑥 • Probability that 𝑥 is a witness for (𝑢, 𝑣) if it is an incorrect match: 𝑝2 𝑠 2 𝑢 𝑝 𝑥 𝑝 𝑣 • If 𝑝 is small, 𝑝𝑠 2 ≫ 𝑝2 𝑠 2 , so the number of witnesses for (𝑢, 𝑣) is much larger on average for a correct match Simulations • In practice, the algorithm can be run repeatedly • Suppose 10% of the matches are correct initially, then by running the algorithm once, one may increase this to something larger than 10% • Run it again, increase the number of correct matches • Repeat several times…. • Threshold phenomenon: if the initial number of correct matches is small, doesn’t help; otherwise, can match “all” nodes correctly E-R Graphs Running the Algorithm Once Fraction of initially correct matches Running the Algorithm Iteratively Fraction of initially correct matches Performance on Various Graph Models Stochastic Block Model Fraction of initially correct matches Barabási-Albert Model Fraction of initially correct matches Possible Algorithm for Seedless Matching 𝐺1 , 𝐺2 Seedless Algorithm (e.g., Convex Relaxation Approach) 𝜋 Witness-Based Correction Technique 𝜋 Seedless Matching In Practice Our algorithm on Barabási-Albert graphs for varying values of 𝑠 Conclusions • Can we recover all node identities w.h.p. with no seeds? • No low-complexity algorithm is known for any random graph model for 𝑠 < 1 • For a stochastic block model, if we are given an initial estimate of the correct permutation, then exact matches can be found w.h.p. • The initial estimate must contain a fraction of nodes that are correctly matched but we don’t need to know which matches are correct • Motivation: convex relaxation or other approaches can provide an initial estimate • Assumption: the correct and incorrect matches are uniformly distributed • Can remove this assumption, but much weaker theoretical guarantees
© Copyright 2026 Paperzz