Detecting Data Leakage

Simrank++: Query Rewriting through
link analysis of the click graph
Ioannis Antonellis
[email protected]
Hector Garcia-Molina
[email protected]
Chi-Chao Chang
[email protected]
Sponsored Search Model
Advertisers
Queries
Bids
Stanford Infolab
2
Auction Model
Ads
query
Relevance
Stanford Infolab
Bid amount
3
Motivating Example
addicting games
No ads!
www.addictinggames.com
Stanford Infolab
4
Motivating Example
free online games
Stanford Infolab
5
Modified Sponsored Search Model
• Advertisers bid on queries
• For each query
– Search engine runs an auction
– ad relevance and bid amount
– Top 5-10 ads get displayed along with regular
search results
• Extra: Advertisers are charged a default
amount in cases where their ads are being
displayed for queries they didn’t bid on
Stanford Infolab
6
Outline
•
•
•
•
•
•
•
Sponsored Search Model
Motivating Example
Query Rewriting using the click graph
Simrank
Evidence-based Simrank
Weighted Simrank
Experiments
Stanford Infolab
7
Sponsored search system
Sponsored Search
System
q
History
Ads
Stanford Infolab
ads
Bids
8
Query Rewriting
q
Front End
ads
q, rewrites for q
History
Back End
Ads
Stanford Infolab
Bids
9
Click Graph from sponsored search
Queries
Ads
Similar Queries
pc
camera
10
Hp.com
Digital camera
20
30
5
Digital camera
camera
pc
camera
pc
Digital camera
tv
camera
tv
Digital camera
pc
tv
Bestbuy.com
7
15
tv
Teleflora.com
16
flower
15
Orchids.com
Clicks
Stanford Infolab
10
Simrank [JW 2003]
• Intuition:
– “Two queries are similar if they are connected to
similar ads”
– “Two ads are similar if they are connected to
similar queries”
• Iterative procedure: at each iteration similarity
propagates in the graph
Stanford Infolab
11
Simrank [JW 2003]
•
•
•
•
N(q): # of ads connected to q
E(q): set of ads connected to q
simk(q,q’): q-q’ similarity at k-th iteration
Initially sim(q,q) = 1, sim(q,q’) = 0, sim(a,a) = 1,
sim(a,a’) = 0
C
s ( q, q ' ) 
s (i, j )


N (q) N (q' )
k
sk ( a, a ' ) 
iE ( q ) jE ( q ')
k 1
C
  sk 1 (i, j )
N (a) N (a' ) iE ( a ) jE ( a ')
• Time: O(n4)
Stanford Infolab
12
Simrank
Queries
Ads
sk ( q, q ' ) 
pc
C
  sk 1 (i, j )
N (q) N (q' ) iE ( q ) jE ( q ')
Hp.com
sk ( a, a ' ) 
camera
Digital camera
C
  sk 1 (i, j )
N (a) N (a' ) iE ( a ) jE ( a ')
Bestbuy.com
Two random surfers model
tv
Teleflora.com
flower
Orchids.com
Clicks
Stanford Infolab
13
Simrank in matrix notation
• Input: transition matrix P, decay factor C,
number of iterations k
• Output: similarity matrix S
Worst case running time:
• For i = 1:k, do
– temp = C
P
– S = temp + I – Diag(diag(temp))
PT S
O(n3), see also next talk
• end
Stanford Infolab
14
Simrank
pc
1st Iteration
camera
digital
camera
tv
pc
1
camera
0.0889
1
digital
camera
0.0889
0.1778
1
tv
0
0.0889
0.0889
1
flower
0
0
0
0
flower
1
C = 0.8
pc
Hp.com
sk ( q, q ' ) 
sk ( a, a ' ) 
C
  sk 1 (i, j)
N (q) N (q' ) iE ( q ) jE ( q ')
C

 sk 1 (i, j )
N (a) N (a' ) iE ( a ) jE ( a ')
camera
Digital camera
bestbuy.com
tv
teleflora.com
flower
orchids.com
Stanford Infolab
15
Simrank
pc
2nd Iteration
camera
digital
camera
tv
pc
1
camera
0.1244
1
digital
camera
0.1244
0.2489
1
tv
0.0356
0.1244
0.1244
1
flower
0
0
0
0
flower
1
C = 0.8
pc
Hp.com
sk ( q, q ' ) 
sk ( a, a ' ) 
C
  sk 1 (i, j)
N (q) N (q' ) iE ( q ) jE ( q ')
C

 sk 1 (i, j )
N (a) N (a' ) iE ( a ) jE ( a ')
camera
Digital camera
bestbuy.com
tv
teleflora.com
flower
orchids.com
Stanford Infolab
16
Simrank
pc
12th Iteration
camera
digital
camera
tv
pc
1
camera
0.1650
1
digital
camera
0.1650
0.33
1
tv
0.0761
0.1650
0.1650
1
flower
0
0
0
0
flower
1
C = 0.8
pc
Hp.com
sk ( q, q ' ) 
sk ( a, a ' ) 
C
  sk 1 (i, j)
N (q) N (q' ) iE ( q ) jE ( q ')
C

 sk 1 (i, j )
N (a) N (a' ) iE ( a ) jE ( a ')
camera
Digital camera
bestbuy.com
tv
teleflora.com
flower
orchids.com
Stanford Infolab
17
Outline
•
•
•
•
•
•
•
Sponsored Search Model
Motivating Example
Query Rewriting using the click graph
Simrank
Evidence-based Simrank
Weighted Simrank
Evaluation
Stanford Infolab
18
Evidence-based Simrank
• Problem: Simrank scores in complete bipartite
graphs are counter-intuitive
• See Theorems in paper, here examples for
intuition
Evidence-based Simrank
iteration
Camera –
digital camera
Pc camera
1
0.4
0.3
0.8
0.4
2
0.56
0.42
0.8
0.4
3
0.624
0.468
0.8
0.4
4
0.6496
0.4872
0.8
0.4
5
0.65984
0.49488
0.8
0.4
6
0.663933
0.497952
0.8
0.4
pc
Hp.com
camera
camera
Hp.com
Digital camera
evidence(q, q' ) 
Bestbuy.com
E ( q )  E ( q ')

i 1
1
i
C = 0.8
2
k
simevidence
(q, q' )  evidence(q, q' )  sim k (q, q' )
Stanford Infolab
20
Evidence-based Simrank
iteration
Camera –
digital camera
Pc camera
1
0.3
0.4
2
0.42
0.4
3
0.468
0.4
4
0.4872
0.4
5
0.49488
0.4
6
0.497952
0.4
pc
Hp.com
camera
camera
Hp.com
Digital camera
evidence(q, q' ) 
Bestbuy.com
E ( q )  E ( q ')

i 1
1
i
C = 0.8
2
k
simevidence
(q, q' )  evidence(q, q' )  sim k (q, q' )
Stanford Infolab
21
Outline
•
•
•
•
•
•
•
Sponsored Search Model
Motivating Example
Query Rewriting using the click graph
Simrank
Evidence-based Simrank
Weighted Simrank
Evaluation
Stanford Infolab
22
Weighted Simrank
flower
1000
Teleflora.com
orchids
1000
flower
1000
orchids
1
Teleflora.com
Variance on weights matters
Stanford Infolab
23
Weighted Simrank
flower
1000
Teleflora.com
orchids
1000
flower
1
orchids
1
Teleflora.com
Absolute value of weights matters
Stanford Infolab
24
Weighted Simrank
p (a, i )  spread (i )  normalized _ weight (a, i ), i  E (a)
p ( a, a )  1 
 p ( a, i )
iE ( a )
spread (i ) 
1
var iance (i )
normalized _ weight (a, i ) 
w(a, i )
 w(a, j )
pc
Hp.com
camera
jE ( a )
Digital camera
bestbuy.com
tv
teleflora.com
flower
orchids.com
Stanford Infolab
25
Simrank++
• Input: transition matrix P’, evidence matrix V,
decay factor C, number of iterations k
• Output: similarity matrix S’
• For i = 1:k, do
– temp = C P’T S’ P’
– S’ = temp + I – Diag(diag(temp))
• End
• S’ = V.*S’
Stanford Infolab
26
Outline
•
•
•
•
•
•
•
Sponsored Search Model
Motivating Example
Query Rewriting using the click graph
Simrank
Evidence-based Simrank
Weighted Simrank
Evaluation
Stanford Infolab
27
Evaluation
• Dataset:
– 2 weeks Yahoo! click graph, 15 million queries, 14
million ads, 28 million edges
– Extracted largest connected component and
further decomposed it into 5 subgraphs (details in
the paper)
– Edge weights: adjusted clicks over impressions
rate (to account for position bias)
• Evaluation set:
– 120 queries sampled from search engine traffic
Stanford Infolab
28
Evaluation
• Comparison with:
– Pearson similarity
sim (q, q' ) 
 (w(q, a)  w )( w(q' , a)  w
aE ( q )  E ( q ')
q
q'
)
2
2
(
w
(
q
,
a
)

w
)
(
w
(
q
'
,
a
)

w
)

q
q'
aE ( q )  E ( q ')
– Jaccard similarity
sim (q, q ' ) 
E (q)  E (q' )
E (q)  E (q' )
– cosine similarity
Stanford Infolab
29
Metrics
– Precision/recall (manual evaluation)
• Precision(q) = relevant rewrites of q / number of
rewrites for q (among all methods)
• Recall(q) = relevant rewrites of q / number of relevant
rewrites for q (among all methods)
– Query coverage
• Number of queries for which the method gives at least
one rewrite
– Query rewriting depth
• Total number of rewrites for a given query
Stanford Infolab
30
Evaluation
Stanford Infolab
31
Evaluation
Stanford Infolab
32
Evaluation
Stanford Infolab
33
Evaluation
Stanford Infolab
34
Conclusions/Open issues
• Proposed use of Simrank for query rewriting
• Two extensions: evidence-based, weighted
• Simrank++ overall best method
•
•
•
•
Ad Selection models
Blend with semantic text-similarity methods
Incremental computation of Simrank++ values
Applications to recommendation systems
Stanford Infolab
35
Thank You!
http://infoblog.stanford.edu
Stanford Infolab
36