Document

Graph similarity
P. Van Dooren, CESAME, Univ. catholique Louvain
based on work with Blondel, Gajardo, Heymans, Sennelart
The web graph
Nodes = web pages, Edges = hyperlinks between pages
Several billion webpages
Average of 7 outgoing links
Structure of the web
Experiments : two crawls over 200 million pages in 1999
found a giant strongly connected component (core)
- Contains most prominent sites
- It contains 30% of all pages
- Average distance between
nodes is 16
- Small world
In- and out-degree distributions
Power law distribution : number of pages of in-degree n is
proportional to 1/n2.1 (Zipf law)
A score for every page
The score of a page is high if the page has many incoming
links coming from pages with high page score
One browses from page to page by following outgoing links
with equal probability. Score = frequency a page is visited.
… some pages may have no outgoing links
… many pages have zero frequency
PageRank : teleporting random score
The surfer follows a path by choosing an outgoing link with probability
p/dout(i) or teleports to a random web page with probability 0<1-p <1.
Put the transition probability of i to j in a matrix M (bij=1 if i→j)
mij = p bij /dout(i) + (1-p)/n
then the vector x of probability distribution on the nodes of the graph
is the steady state vector of the iteration xk+1=Mxk i.e. the dominant
eigenvector of the matrix M (unique because of Perron-Frobenius)
PageRank of node i is the (relative) size of element i of this vector
Kleinberg’s structure graph
The score of a page is high if the page has
many incoming links
The score is high if the incoming links are
from pages that have high scores
This inspired Kleinberg’s “structure graph”
hub
authority
Good authorities for “University Belgium”
A good hub for “University Belgium”
Hub and authority scores
The hub score hi and authority score ai ought to be mutually reinforcing :
pages with large hi point to pages with high aj
pages with large ai are pointed to by pages with high hj
hi ← Σ j:(i→j) aj
ai ← Σ j:(j→i) hj
or, using the adjacency matrix B of the graph (bij=1 if i→j is an edge)
h
=
a k+1
0 B
BT 0
h
a
k
h
a
=
0
1
1
Use limiting vector a (dominant eigenvector of BTB) to rank pages
Alternative method : structure graph
Give three scores to each web page : begin b, center c, end e
b
c
e
Use again mutual reinforcement to define the iteration
bj ←
cj ← Σ i:(i→j) bi
ej ←
Σ i:(j→i) ci
+ Σ i:(j→i) ei
Σ i:(i→j) ci
Defines a limiting vector for the iteration
xk+1 = M xk,
x0= 1
b
where x = c
e
,
0 B 0
M = BT 0 B
0 BT 0
Bow tie example
graph A
1
•→•2
graph B
2
1
ρ 0
0 ρ
0 0
1 0
:
:
:
:
S= 0 0
S= 1 0
0 1
0 0
:
:
:
:
0 1
0 0
if m>n
if n>m
not satisfactory
n+1
n+m+1
Bow tie example
0 ρ 0
graph A
1 0 0
1
•→•→•3
:
:
:
S= 1 0 0
2
0 0 1
:
graph B
2
:
0 0 1
1
n+1
:
central score is good
n+m+1
Towards arbitrary graphs
For the graph
For the graph
•→•
•→ • → •
A=
A=
0
1
0
0
and M =
0
1
0
0
0
1
0
0
0
and M =
0
B
BT
0
0
B
0
BT
0
B
0
BT
0
Formula for M for two arbitrary graphs GA and GB :
M= A
B + AT
BT
With xk =vec(Xk) iteration xk+1 = M xk is equivalent to Xk+1 = BXk AT+BT Xk A
Similarity matrix of two arbitrary graphs
For A and B adjacency matrices of the two graphs S solves
ρS = A S BT + AT S B
This matrix can be obtained via fixed point of power method (linear)
Ref: Blondel et al, SIAM Rev., ‘04
Similarity matrix of two arbitrary graphs
For A and B adjacency matrices of the two graphs S solves
ρS = A S BT + AT S B
Two nodes are similar if their parents and children are similar
Such a recursive definition leads to an eigenvector equation
Ref: Blondel et al, SIAM Rev., ‘04
Algorithm
The (normalized) sequence
Xk+1 = (AXk BT+AT Xk B)/ ||AXk BT+ATXkB||F
has two fixed points Xeven and Xodd for every X0>0
Similarity matrix S = lim k→∞ X2k , X0 =1
Si,j is the similarity score between Vi (A) and Vj (B)
With xk=vec(Xk), this is equivalent to the power method
xk+1 = (B  A + BT  AT )xk / ||(B  A + BT  AT )xk||2
which is the power method on M = B  A + BT  AT
Some properties
Satisfies
ρS=ASBT+ATSB, ρ=||ASBT+ATSB||F
It is the nonnegative fixed point S of largest 1-norm
It solves the optimization problem
max  ASBT+ATSB , S  subject to ||S||F=1
Extension of Kleinberg’s Hits method
Linear convergence (power method for sparse M)
Other properties
•
Central score is a dominant eigenvector of BBT+BTB
(cfr. hub score of BBT and authority score of BTB)
•
Similarity matrix of a graph with itself is square and semi-definite.
Path graph • → • → •
Cycle graph
.4
0
0
1
1
1
0
.8
0
1
1
1
0
0
.4
1
1
1
The dictionary graph
OPTED, based on Webster’s unabridged dictionary
http://msowww.anu.edu.au/~ralph/OPTED
Nodes = words present in the dictionary : 112,169 nodes
Edge (u,v) if v appears in the definition of u : 1,398,424 edges
Average of 12 edges per node
In and out degree distribution
Very similar to web (power law)
Words with highest in degree :
of, a, the, or, to, in …
Words with null out degree :
14159, Fe3O4, Aaron,
and some undefined or misspelled words
Neighborhood graph
is the subset of vertices used for finding synonyms :
it contains “all” parents and children of the node
neighborhood graph of likely
“Central” uses this sub-graph to rank automatically synonyms
Rank each node in the graph with the similarity to node c in
b
Ref: Blondel et al, SIAM Rev., ‘04
c
e
Disappear
Vectors
Central
ArcRanc
Wordnet
Microsoft
1
vanish
vanish
epidemic
vanish
vanish
2
wear
pass
disappearing
go away
cease to exist
3
die
die
port
end
fade away
4
sail
wear
dissipate
finish
die out
5
faint
faint
cease
terminate
go
6
light
fade
eat
cease
evaporate
7
port
sail
gradually
wane
8
absorb
light
instrumental
expire
9
appear
dissipate
darkness
withdraw
10
cease
cease
efface
pass away
Mark
3.6
6.3
1.2
7.5
8.6
Std Dev
1.8
1.7
1.2
1.4
1.3
Vectors, Central and ArcRank are automatic, Wordnet, Microsoft Word are manual
Science
Vectors
Central
ArcRanc
Wordnet
Microsoft
1
art
art
formulate
knowledge domain
discipline
2
branch
branch
arithmetic
knowledge base
knowledge
3
nature
law
systematize
discipline
skill
4
law
study
scientific
subject
art
5
knowledge
practice
knowledge
subject area
6
principle
natural
geometry
subject field
7
life
knowledge
philosophical
field
8
natural
learning
learning
field of study
9
electricity
theory
expertness
ability
10
biology
principle
mathematics
power
Mark
3.6
4.4
3.2
7.1
6.5
Std Dev
2.0
2.5
2.9
2.6
2.4
Parallelogram
Vectors
Central
ArcRanc
Wordnet
Microsoft
1
square
square
quadrilateral
quadrilateral
diamond
2
parallel
rhomb
gnomon
quadrangle
lozenge
3
rhomb
parallel
right-lined
tetragon
rhomb
4
prism
figure
rectangle
5
figure
prism
consequently
6
equal
equal
parallelopiped
7
quadrilateral
opposite
parallel
8
opposite
angles
cylinder
9
altitude
quadrilateral
popular
10
parallelopiped
rectangle
prism
Mark
4.6
4.8
3.3
6.3
5.3
Std Dev
2.7
2.5
2.2
2.5
2.6
Sugar
Vectors
Central
ArcRanc
Wordnet
Microsoft
1
juice
cane
granulation
sweetening
darling
2
starch
starch
shrub
sweetener
baby
3
cane
sucrose
sucrose
carbohydrate
honey
4
milk
milk
preserve
saccharide
dear
5
molasses
sweet
honeyed
organic compound
love
6
sucrose
dextrose
property
saccarify
dearest
7
wax
molasses
sorghum
sweeten
beloved
8
root
juice
grocer
dulcify
precious
9
crystalline
glucose
acetate
edulcorate
pet
10
confection
lactose
saccharine
dulcorate
babe
Mark
3.9
6.3
4.3
6.2
4.7
Std Dev
2.0
2.4
2.3
2.9
2.7
Real world application
Typed graphs
Graphs with colored nodes
Fraikin, VD, ECC07
Graphs with colored edges
Neighborhood graph
is the subset of vertices used for finding synonyms :
it contains all parents and children of the node
other
types
neighborhood graph of likely
“Central” uses this sub-graph to rank automatically synonyms
Compares well with Vectors, ArcRank (automatic)
Wordnet, Microsoft Word (manual)
Typed nodes
Partition adjacency matrices
A11 A12
A21 A22
,
B11 B12
B21 B22
and
and compute (for the symmetric case) the Perron vector
vec(S11)
vec(S22)
of
A11B11 A12B12
A21B21 A22B22
S11
O
O S22
Typed edges
Partition source and terminal matrices AS = AS1 AS2 , AT = AT1 AT2
and BS =
BS1 BS2
, BT = BS1 BS2
and compute the left and right singular Perron vectors
vec(N) and
vec(E1)
vec(E2)
of G = AS1BS1+AT1BT1 AS2BS2+AT2BT2
Concluding remarks
Iteration is on large sparse graphs
Complexity of one iteration step is linear in the
number of nodes in both graphs
We have methods with linear convergence
(power-like method and gradient like method)
Extensions to colored nodes and edges