Ranking Deep Web Sources and Online Ads

Trust and Profit Sensitive Ranking for Web
Databases and On-line Advertisements
Raju Balakrishnan (Arizona State University)
Agenda
 Trust and Relevance based Ranking of
Web Databases for the Deep Web.
 Ad-Ranking Considering MutualInfluences.
Optimal Ad Ranking for Profit Maximization
Deep Web Integration Problem
Millions of
Databases
Containing
Structured Tuples
Mediator
Uncontrolled
Collection of
Redundant
Information
Web DB
Web DB
Web DB
Web DB
Web DB
Deep Web
Source Selection in Deep Web
Given a user query, select a subset of sources to
provide most relevant and trustworthy answers.
Trustworthiness: Degree of Belief in the correctness of the data
Relevance: Degree by which the data satisfies the information
needs of the user.
Search Results must be Trustworthy and
Relevant.
Surface web Search combines hyper-link based PageRank and
Relevance to Assure trust and relevance of results.
Source Agreement
Observations
 Many Sources Return Answers to the Same Query.
 Comparison of Semantics of the answers is facilitated by
structure of the tuples
Idea: Compare Agreement of Answers Returned by
Different Sources to Assess the Reputation of Sources!
Agreement Based Relevance and
Trust assessment May be intuitively
understood as a meta-reviewer
assessing quality of a paper based
on agreement between primary
reviews. Reviewers agreed upon by
other reviewers are likely to be
relevant and trustworthy.
Agreement Implies Trust & Relevance
Probability of Agreement or Two
independently selected
irrelevant/false tuples
1
Pa (r1, r 2) 
|U |
Probability of Agreement or two
independently picked relevant
and true tuples is
Pa ( g1, g 2) 
1
| RT |
U  RT  Pa ( g1, g 2)  Pa (r1, r 2)
Computing Agreement between Sources
Closely Related to Record Linkage Problem for
Integration of databases without common domains
(Cohen 98).
We used a Greedy matching between tuples
using Jaro-Winkler similarity with SoftTF-IDF, since
this measure performs best for named entity
matching (Cohen et al. 03)
Agreement computed using top-5 answer tuples
to sample queries (200 queries each domain).
 The computation complexity is O(V 2 k 2 ) ; where V
is number of data sources, using top-k answers.
Representation: Agreement Graph
W ( S 1  S 2)    (1   ) 
0.78
S3
S1
0.86
0.22
A( R1, R 2)
| R2 |
where  induces the smoothing links
to account for the unseen samples.
R1, R2 are the result sets of S1, S2 .
0.4
0.14
0.6
S2
Link Semantics from Si to Sj
with weight w: Si acknowledges
w fraction of tuples in Sj
Sample agreement graph for the
book sources.
Calculating SourceRank
How do I Search
using the agreement
graph?
1. Start on a random node
2. If he likes the result,
randomly traverse a
link, with a probability
proportional
to
its
weight to search an
agreed database.
3. If he does not like the
node, restart the search
traversing a smoothing
link.
This is a Weighted Markov Random Walk.
The visit probability of the searcher for a database is given by the
stationary visit probability of the random walk on the database vertex.
SourceRank is equal to this stationary visit probability of
the random walk on the database vertex.
Combining Coverage and SourceRank
Coverage of a set of tuples T w.r.t a query q
C (T | q)   R(t | q)
tT
Coverage is calculated using sample queries, and we
used Jaro-Winkler with SoftTF-IDF similarity between
the query and the tuple as the relevance measure.
We combine the Coverage and SourceRank as
Score    Coverage  (1   )  SourceRank
Databases are ranked based on this Score, with   0.5.
Evaluations and Results
Evaluated in movies and books domain web databases
listed in UIUC TEL-8 repository, twenty two from each
domain.
Evaluation Metrics
1.
2.
3.
4.
Ability to remove closely related out of domain Sources.
Top-5 precision. (relevance evaluation)
Ability to remove corrupted sources (trustworthiness)
Time to Compute the Agreement Graph
1. Ranks of Out of Domain Sources
Source 1
Non-Topical Source
Source 2
Source 3
Source 4
0
←Rank of the Source
2
4
6
8
10
12
14
16
18
20
Coverage
Combined
SourceRank
Source 5
2. Top-5 Precision-Movies
Movies Top-4 Source Selection
Movies Top-8 Source Selection
0.5
0.4
0.3
0.2
0.1
36%
Precision→
Precision→
0.5
0.4
0.3
0.2
0.1
0
40%
2. Top-5 Precision-Books
Top-4 Source Selection
Top-8 Source Selection
0.4
0.4
0.35
0.35
0.3
Precision→
Precision→
0.3
0.25
0.25
0.2
0.15
0.1
0.2
0.05
0
0.15
Coverage
Source Rank Combination
Coverage
Source Rank
Combination
3. Trustworthiness of Source Selection
Trustworthiness-Movies
Trustworthiness-Books
SourceRank
Coverage
45
40
Percentage of Decrease in Position→
Percentage of Decrease in Position→
50
35
30
25
20
15
10
5
0
0
0.1
0.2
0.3
Corruption Level→
0.4
0.5
35
SourceRank
Coverage
30
25
20
15
10
5
0
0
0.1 0.2 0.3 0.4 0.5
Corruption Level→
4. Time to Compute Agreement Graph
Time Vs number of Sources
Time Vs top-k tuples
70
140
Books
Movies
50
Books
Movies
120
Time (Seconds)→
Time (Seconds)→
60
40
30
20
10
100
80
60
40
20
0
5
10
15
Number of Sources→
20
0
3
6
9
12 15 18
k (top-k tuples)→
21
System Implementation
Searches Online
books and movies
Web Databases
http://rakaposhi.eas.asu.edu/scuba
System Architecture
•Implemented as a web
application.
•Searches real web
databases
Agenda
 Trust and Relevance based Ranking of
Web Databases for the Deep Web.
 Ad-Ranking Considering MutualInfluences.
Optimal Ad Ranking for Profit Maximization
Ad Ranking: State of the Art
Sort by
Sort by
Bid Amount x Relevance
Bid Amount
Ads are Considered in Isolation, Ignoring Mutual influences.
We Consider Ads as a Set, and ranking is
based on User’s Browsing Model
Optimal Ad Ranking for Profit Maximization
Mutual Influences
Three Manifestations of Mutual Influences on an Ad
a are
1. Similar ads placed above a
 Reduces user’s residual relevance of the ad
2. Relevance of other ads placed above a
a
 User may click on above ads may not view the ad
3. Abandonment probability of other ads placed
above a
 User may abandon search and not view the ad a
Optimal Ad Ranking for Profit Maximization
a
User’s
Browsing
Model
If a 2 is similar to a1
residual relevance of a 2
•User Browses Down Staring at the
goes down and
abandonment
probabilities goes up.
first Ad
• At every Ad he May
Click the Ad With Relevance Probability
R(a)  P(click (a) | view(a))
Abandon Browsing with Probability
Goes Down to next Ad with probability
Process Repeats for the
Ads Below With a
Reduced Probability
Optimal Ad Ranking for Profit Maximization
Expected Profit Considering Ad Similarities
Considering Bid Amounts ($( ai )), Residual Relevance ( R ( ai ) ),
abandonment probability (  ( ai ) ), and similarities the expected
profit from a set of n ads is,
i 1
n
Expected Profit =
 $(a )R (a ) 1  R (a )   (a )
i
i 1
r
i
r
j
j
j 1
THEOREM: Optimal Ad Placement Considering Similarities
between the ads is NP-Hard
Proof is a reduction of independent set problem to choosing
top k ads considering similarities.
Optimal Ad Ranking for Profit Maximization
Expected Profit Considering other two Mutual
Influences (2 and 3)
Dropping similarity, hence replacing Residual Relevance
( Rr ( ai )) by Absolute Relevance ( R ( ai )),
i 1
n
Expected Profit =
 $(a )R(a ) 1  R(a )   (a )
i
i 1
i
j
j
j 1
Ranking to Maximize This Expected Profit is a
Sorting Problem
Optimal Ad Ranking for Profit Maximization
Optimal Ranking
Rank ads in Descending order of:
$(a) R(a)
RF (a) 
$(a)   (a)
The physical meaning RF is the profit generated for
unit consumed view probability of ad
Ads above have more view probability. Placing ads
producing more profit per consumed view probability
is intuitively justifiable. (Refer Balakrishnan & Kambhampati (WebDB
08) for proof of optimality)
Optimal Ad Ranking for Profit Maximization
Comparison to Yahoo and Google
Yahoo!
Google
 Assume abandonment
probability is zero
Assume
 (a)  k  R(a)
 (a)  0
where
k
is a constant for all ads
$( a ) R (a )
 (a) 
 $( a ) R (a )
k
$(a) R(a)
 (a) 
 $(a)
R(a)
Assumes that the user
has infinite patience to
go down the results until
he finds the ad he wants.
Assumes that
abandonment probability
is negatively proportional
to relevance.
Optimal Ad Ranking for Profit Maximization
Quantifying Expected Profit
40
RF
Bid Amount x Relevance
Bid Amount
35
Expected Profit 
30
Abandonment
Probability
Bid Amount
Only strategy
becomes
optimal at  (a)  0
Uniform Random as
0   (a)  
Relevance
Difference in profit
between RF and
competing strategy
is significant
25
20
Uniform Random as
  R(a)  1
Number of Clicks
Zipf Random with
exponent 1.5
15
35.9%
Proposed strategy
gives maximum
Bid Amounts
profit for the entire Uniform Random
range
45.7%
10
5
0  $(a)  10
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1

Optimal Ad Ranking for Profit Maximization
Contributions
SourceRank
 Agreement based computation of relevance and
trust of deep web sources.
 System implementation to search the deep web,
and formal evaluation.
Ad-Ranking
 Extending Expected Profit Model of Ads Based on
Browsing Model, Considering Mutual Influences
 Optimal Ad Ranking Considering Mutual
Influences Other than Ad Similarities.
Thank You!
Optimal Ad Ranking for Profit Maximization
Deep Web Integration Roadmap