Adaptive Information Integration - Subbarao Kambhampati

Adaptive Information Integration
Subbarao Kambhampati
http://rakaposhi.eas.asu.edu/i3
Thanks to Zaiqing Nie, Ullas Nambiar & Thomas Hernandez
Talk at USC/Information Sciences Institute; November 5th 2004.
Yochan Research Group
•
Plan-Yochan
Automated Planning
– Temporal planning
•
•
Multi-objective optimization
Partial satisfaction planning
– Conditional/Conformant/Stochastic planning
•
Heuristics using labeled planning graphs
– OR approaches to planning
– Applications to
•
•
•
Autonomic computing,
Web service composition,
Workflows
Db-Yochan
• Information Integration
– Adaptive Information
Integration
• Learning source profiles
• Learning user interests
– Applications to
• Bio-informatics
• Anthropological sources
– Service and Sensor Integration
Services
Source Catalog
Ontologies;
statistics
Probing
Queries
Webpages
Structured
data
Learned
Statistics
Sensors
(streaming
Data)
Our focus:
Query Processing
Query planner
Multi-objective
Anytime
Handle services,
Sensors (streams)
Annotated
Plan
Answers
Executor
Monitor
Adaptive Information Integration
• Query processing in information integration needs to
be adaptive to:
– Source characteristics
• How is the data spread among the sources?
– User needs
• Multi-objective queries (tradeoff coverage for cost)
• Imprecise queries
• To be adaptive we need, profiles (meta-data) about
sources as well as users
– Challenge: Profiles are not going to be provided..
• Autonomous sources may not export meta-data about data spread!
• Lay users may not be able to articulate the source of their imprecision!
Need approaches that gather (learn) the meta-data they need
Three contributions to
Adaptive Information Integration
• BibFinder
/Statminer
– Learns and uses source coverage and overlap statistics to
support multi-objective query processing
• [VLDB 2003; ICDE 2004; TKDE 2005]
• COSCO
– Adapts the Coverage/Overlap statistics to text collection
selection
•
– Supports imprecise queries by automatically learning
approximate structural relations among data tuples
• [WebDB 2004; WWW 2004]
Although we focus on avoiding retrieval of duplicates,
Coverage/Overlap statistics can also be used to look for duplicates
Adaptive Integration of
Heterogeneous Power Point Slides
Source Catalog
Ontologies;
statistics
Probing
Queries
Learned
Statistics
Ca
lls
MS Thesis Defense
10/21/2004
M
ty
ili
Ut
Executor
By Thomas Hernandez
tics
atis
g St
Answers
ing
nn
pla ts
Re ques
Re
Annotated
Plan
atin
etr
ic
y
Upd
er
u
Q
ce
Multi-objective
Anytime
Handle services,
Sensors (streams)
Improving Text Collection Selection
using Coverage and Overlap
Statistics
So
ur
Query planner
Monitor
Arizona State University
Mining Approximate Functional Dependencies
&
Concept Similarities to Answer Imprecise Queries
Ullas Nambiar
Different template “schemas”
Different Font Styles
Naïve “concatenation” approaches don’t work!
Subbarao Kambhampati
Dept of CS & Engg
Arizona State University
http://rakaposhi.eas.asu.edu/i3/
WebDB, June 17-18 2004, Paris, France
Part I: BibFinder
• BibFinder: A popular CS bibliographic mediator
– Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore,
ScienceDirect, Network Bibliography, CSB, CiteSeer
– More than 58000 real user queries collected
• Mediated schema relation in BibFinder:
paper(title, author, conference/journal, year)
Primary key: title+author+year
• Focus on Selection queries
Q(title, author, year) :- paper(title, author, conference/journal, year),
conference=SIGMOD
P ( DBLP | Q, CSB)  P( DBLP | Q)
 P (CSB  DBLP | Q)
Background & Motivation
• Sources are incomplete and partially overlapping
• Calling every possible source is inefficient and impolite
• Need coverage and overlap statistics to figure out what
sources are most relevant for every possible query!
• We introduce a frequency-based approach for mining
these statistics
Coverage: probability that a random answer
tuple for query Q belongs to source S.
Noted as P(S|Q).
Overlap: Degree to which sources contain
the same answer tuples for query Q.
Noted as P(S1 ^ S2 ^ … ^ Sk |Q).
DBLP
ACMDL
CSB
Challenges
• Challenges of gathering coverage and overlap statistics
– It’s impractical to assume that the sources will export such statistics, because
the sources are autonomous.
– It’s impractical to learn and store all the statistics for every query.
• Necessitate NQ  2 N s different statistics, N Q is the number possible queries, N S
is the number of sources
• Impractical to assume knowledge of entire query population a priori
• We introduce StatMiner
–
–
–
–
A threshold based hierarchical mining approach
Store statistics w.r.t. query classes
Keep more accurate statistics for more frequently asked queries
Handling the efficiency and accuracy tradeoffs by adjusting the
thresholds
BibFinder/StatMiner
CSB
DBLP
ACM
DL
Netbib
Learn
AV
Hierarchies
Science
Direct
Query List
Citeseer
Discover
Frequent
Query
Classes
So
ca
e
c
ur
lls
User
Query
Statistics
Learn
Coverage
and Overlap
Answer
Tuples
Query List & Raw Statistics
Query
Frequency
Distinctive
Overlap (Coverage)
Answers
Author=”andy king”
Query List: the mediator maintains an
XML log of all user queries, along with
their access frequency, number of
total distinct answers obtained, and
number of answers from each source
set which has answers for the query.
Given the query list, we can
compute the raw statistics
for each query: P(S1..Sk|q)
106
46
DBLP
35
CSB
23
CSB, DBLP
12
DBLP, Science
3
Science
3
CSB, DBLP, Science 1
Author=”fayyad”
Title=”data mining”
1
27
CSB, Science
1
CSB
16
DBLP
16
CSB, DBLP
7
ACMdl
5
ACMdl, CSB
3
ACMdl, DBLP
3
ACMdl, CSB, DBLP
2
Science
1
AV Hierarchies and Query Classes
Attribute-Value Hierarchy:
An AV Hierarchy is a classification of
the values of a particular attribute of
the mediator relation. Leaf nodes in
the hierarchy correspond to concrete
values bound in a query.
Query Class: queries are grouped into
classes by computing cartesian
products over the AV Hierarchies.
A query class is a set of queries that
all share a set of assignments of
particular attributes to specific values.
RT
2001
2002
AV Hierarchy for the Year Attribute
RT
DB
SIGMOD
AI
ICDE
AAAI
ECP
AV Hierarchy for the Conference Attribute
RT,RT
DB,RT
SIGMOD,RT
SIGMOD01
DB,01
RT,02
ICDE,RT
DB,02
ICDE01
ICDE02
Query Class Hierarchy
RT,01
AAAI,RT
AI,RT
AI,01
ECP,RT
AAAI01
ECP01
StatMiner
Learning AV Hierarchies
 Attribute values are extracted from the
query list.
 Clustering similar attribute values leads
to finding similar selection queries based
on the similarity of their answer
distributions over the sources.
d (Q1, Q 2 ) 
ˆ | Q1)  P ( Sˆ | Q 2 )]2
[
P
(
S
 i
i
i
 The AV Hierarchies are generated using
an agglomerative hierarchical clustering
algorithm.
 They are then flattened according to
their tightness.
tightness (C ) 
C2
C1

QC
D(C1,C2) <=
1/tightness(C1)
C2
A3
A1
A1
1
P(Q)
d (Q, C )
P(C )
A2
A2
A3
Flattened AV Hierarchy
Discovering Frequent
Query Classes
 Candidate frequent query
classes are identified using the
anti-monotone property.
 Classes which are infrequently
mapped are then removed.
Learning Coverage and Overlap
Coverage and overlap statistics
are computed for each frequent
query class using a modified
Apriori algorithm.
P ( Sˆ | C ) 
Raw Stats
QC P( Sˆ | Q) P(Q)
P (C )
Using Coverage and Overlap Statistics to
Rank Sources
P ( DBLP | Q, CSB)  P( DBLP | Q)
 P (CSB  DBLP | Q)
1. A new user query is mapped to a set of least
general query classes.
2. The mediator estimates the statistics for the
query using a weighted sum of the statistics of
the mapped classes.
3. Data sources are ranked and called in order of
relevance using the estimated statistics.
In particular:
- The most relevant source has highest
coverage
- The next best source has highest residual
coverage
As a result, the maximum number of tuples are
obtained while the least number of sources are
called.
DBLP
ACMDL
CSB
Example:
Here, CSB has highest coverage,
followed by DBLP. However, since
ACMDL has higher residual coverage
than DBLP, the top 2 sources that
would be called are CSB and ACMDL.
BibFinder/StatMiner Evaluation
Purpose of the experiments:
 Analysis of space consumption
 Estimation of the accuracy of the learned statistics
 Evaluation of the effectiveness of those statistics in
BibFinder.
Query planning algorithms used in the experiments:
- Random Select (RS): without any stats.
- Simple Greedy (SG): only coverage stats.
- Greedy Select (GS): coverage and overlap stats.
Precision of a plan: fraction of sources in the
estimated plan which are the actual top sources.
Experimental setup with BibFinder:
•Mediator relation:
Paper(title,author,conference/jo
urnal,year)
•25000 real user queries are
used. Among them 4500
queries are randomly chosen as
test queries.
• AV Hierarchies for all of the
four attributes are learned
automatically.
• 8000 distinct values in author,
1200 frequent asked keywords
itemsets in title, 600 distinct
values in conference/journal,
and 95 distinct values in year.
Learned Conference Hierarchy
Plan Precision
1
Fraction of true top-K
sources called
• Here we observe the
average precision of
the top-2 source plans
0.9
0.8
precision
RS
SG0
GS0
SG0.3
GS0.3
0.7
0.6
0.5
0.4
0.03
0.13
0.23
0.33 0.43
minfreq(%)
0.53
0.63
0.73
• The plans using our
learned statistics have
high precision
compared to random
select, and it decreases
very slowly as we
change the minfreq and
minoverlap threshold.
Number of Distinct Results
Number of distinct answers
53
48
RS
SG0
GS0
SG0.3
GS0.3
43
38
33
28
0.03
0.13
0.23
0.33
0.43
minfreq(%)
0.53
0.63
0.73
• Here we observe
the average number
of distinct results of
top-2 source plans.
• Our methods gets
on average 50
distinct answers,
while random
search gets only
about 30 answers.
Plan Precision on Controlled Sources
1
Greedy Select
Simple Greedy
Random Select
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
2.25
Threshold(%)
We observer the plan precision of top-5 source plans (totally 25
simulated sources). Using greedy select do produce better plans. See
Section 3.8 and Section 3.9 for detailed information
Towards Multi-Objective Query Optimization
(Or What good is a high coverage source
that is off-line?)
•
Sources vary significantly in
terms of their response times
–
The response time depends both on the
source itself, as well as the query that is
asked of it
•
•
•
Specifically, what fields are bound in
the selection query can make a
difference
Hard enough to get a high
coverage or a low response time
plan. But now we have to
combine them…
Challenges:
1.
2.
How do we gather response time
statistics
How do we define an optimal plan in
the context of both coverage/overlap
and response time requirements?
Response times of BibFinder Tuples
Response time can depend on the query type
Range queries on year
Effect of binding
author field
--Response times can also depend on the time of the day,
and the day of the week [Raschid et. al. 2002].
Multi-objective Query optimization
• Need to optimize queries jointly for both high
coverage and low response time
– Staged optimization won’t quite work.
• An idea: Make the source selection be dependent on
both (residual)coverage and response time
Some possible utility functions we experimented with:
[CIKM, 2001]
Results on BibFinder
Part II: Text Collection Selection
with ,,,,,,,,,,
Selecting among overlapping collections
 Overlap between
collections

Results
“bank mergers”
News meta-searcher,
bibliography search
engine, etc.
 Objectives:
 Retrieve variety of
results
 Avoid collections
with irrelevant or
redundant results
1. ……
2. ……
3. ……
.
.
Collection
Selection
Collections:
1. FT
2. CNN
WSJ
Existing work (e.g. CORI) assumes
collections are disjoint!
10/21/2004
Query
Execution
Results
Merging
WP
FT
CNN
NYT
Arizona State University
The
Approach
“COllection Selection with Coverage and Overlap Statistics”
Collection Selection System
User query
Gather coverage
and overlap
information for
past queries
Map the query to
frequent item sets
Compute statistics
for the query using
mapped item sets
Collection
Order
1. ……
2. ……
.
Determine
collection order for
query
Online Component
Identify frequent
item sets among
queries
Coverage / Overlap
Statistics
Compute statistics
for the frequent
item sets
Offline Component
Queries are keyword sets; Query classes are frequent keyword subsets
10/21/2004
Arizona State University
Challenge: Defining & Computing Overlap
Collection overlap may be
non-symmetric, or
“directional”. (A)
Document overlap may be
non-transitive. (B)
Collection C1
1. Result A
2. Result B
3. Result C
4. Result D
5. Result E
6. Result F
7. Result G
Collection C2
A.
1. Result V
2. Result W
3. Result X
4. Result Y
5. Result Z
Collection C1
Collection C2
1. Result A
2. Result B
3. Result C
4. Result D
5. Result E
6. Result F
7. Result G
1. Result V
2. Result W
3. Result X
4. Result Y
5. Result Z
B.
Collection C3
1. Result I
2. Result J
3. Result K
4. Result L
5. Result M
10/21/2004
Arizona State University
Gathering Overlap Statistics
Solution:
Consider query result set
of a particular collection
as a single bag of words:
Approximate overlap
between 3+ collections
using only pairwise
overlaps
Approximate overlap as
the intersection between
the result set bags:
10/21/2004
Arizona State University
Controlling Statistics
 Objectives:
 Limit the number of statistics stored
 Improve the chances of having statistics for new
queries
 Solution:
 Identify frequent item sets among queries (Apriori
algorithm)
 Store statistics only with respect to these frequent item
sets
10/21/2004
Arizona State University
The Online Component
Collection Selection System
Purpose: determine
collection order for user
query
1. Map query to stored
item sets
2. Compute statistics for
query
User query
Gather coverage
and overlap
information for
past queries
Map the query to
frequent item sets
Compute statistics
for the query using
mapped item sets
Collection
Order
1. ……
2. ……
.
Identify frequent
item sets among
queries
Coverage / Overlap
Statistics
Determine
collection order for
query
Compute statistics
for the frequent
item sets
Online Component
Offline Component
Map the query to
frequent item sets
Compute statistics
for the query using
mapped item sets
3. Determine collection
order
10/21/2004
Determine
collection order for
query
Arizona State University
Creating the Collection Test Bed
 6 real collections were
probed:
 ACM Digital Library,
Compendex, CSB, etc.
 Documents: authors +
title + year +
conference + abstract
 top-20 documents from
each collection
 9 artificial collections
were created:
 6 were proper subsets of
each of the 6 real
collections
 2 were unions of two
subset collections from
above
 1 was the union of 15% of
each real collection
15 overlapping, searchable collections
10/21/2004
Arizona State University
Training our System
 Training set: 90% of the query list
 Gathering statistics for training queries:
 Probing of the 15 collections
 Identifying frequent item sets:
 Support threshold used: 0.05% (i.e. 9 queries)
 681 frequent item sets found
 Computing statistics for item sets:
 Statistics fit in a 1.28MB file
network,neural
 Sample entry:
22 MIX15 0.11855 CI,SC 747
AG 0.07742 AD 0.01893
SC,MIX15 801.13636 …
10/21/2004
Arizona State University
Performance Evaluation
 Measuring number of new and duplicate results:
 Duplicate result: has cosine similarity > 0.95 with at
least one retrieved result
 New result: has no duplicate
 Oracular approach:
 Knows which collection has most new results
 Retrieves large portion of new results early
10/21/2004
Arizona State University
Comparison with other approaches
COVERAGE-ONLY
COSCO
CORI
ORACLE
160.00
Cumulative number of new results
140.00
120.00
100.00
80.00
60.00
40.00
20.00
0.00
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Collection Rank
10/21/2004
Arizona State University
Comparison of COSCO against CORI
results
140.00
16.00
120.00
14.00
100.00
12.00
10.00
80.00
8.00
60.00
6.00
20.00
new
cumulative
160.00
COSCO
18.00
Cumulative number of new results
Cumulative number of new results
CORI
18.00
Number of results, dup, new
160.00
Number of results, dup, new
20.00
dup
140.00
16.00
120.00
14.00
100.00
12.00
10.00
80.00
8.00
60.00
6.00
40.00
4.00
40.00
4.00
20.00
2.00
0.00
0.00
1
2
3
4
5
6
7
8
9
10 11 12
Collection Rank using CORI
13
14
15
20.00
2.00
0.00
0.00
1
2
3
4
5
6
7
8
9
10 11 12 13 14
Collection Rank using Coverage and Overlap
15
CORI: constant rate of change, as many new results as
duplicates, more total results retrieved early
COSCO: globally descending trend of new results, sharp
difference between # of new and duplicates, fewer total
results first
10/21/2004
Arizona State University
Summary of Experimental Results
 COSCO…
 displays Oracular-like behavior.
 consistently outperforms CORI.
 retrieves up to 30% more results than CORI when test
queries reflect training queries.
 can map at least 50% of queries to some item sets,
even in worst-case training queries.
 is a step towards Oracular-like performance, but still
some room for improvement
10/21/2004
Arizona State University
Part III: Answer Imprecise Queries
with
[WebDB, 2004; WWW, 2004]
Why Imprecise Queries ?
Toyota
A Feasible Query
Want a ‘sedan’ priced
around $7000
Make =“Toyota”,
Model=“Camry”,
Price ≤ $7000
What about the price of a
Honda Accord?
Is there a Camry for
$7100?
Solution: Support Imprecise Queries
Camry
$7000
1999
Toyota
Camry
$7000
2001
Toyota
Camry
$6700
2000
Toyota
Camry
$6500
1998
………
Dichotomy in Query Processing
Databases
IR Systems
User knows what she wants
•
User has an idea of
what she wants
•
User query captures
the need to some
degree
•
Answers ranked by
degree of relevance
User query completely
expresses the need
Answers exactly matching
query constraints
Existing Approaches
Similarity search over Vector space
• Data must be stored as vectors of text
WHIRL, W. Cohen, 1998
Enhanced database model
• Add ‘similar-to’ operator to SQL. Distances provided
by an expert/system designer
VAGUE, A. Motro, 1998
• Support similarity search and query refinement over
abstract data types
Binderberger et al, 2003
User guidance
• Users provide information about objects required
and their possible neighborhood
Proximity Search, Goldman et al, 1998
Limitations:
1. User/expert must provide
similarity measures
2. New operators to use distance
measures
3. Not applicable over autonomous
databases
Our Objectives:
1. Minimal user input
2. Database internals not affected
3. Domain-independent &
applicable to Web databases
AFDs based Query Relaxation
Imprecise
Query Map: Convert
Q “like” to “=”
Qpr = Map(Q)
Derive Base
Set Abs
Abs = Qpr(R)
Use Base Set as set of
relaxable selection
queries
Using AFDs find
relaxation order
Derive Extended Set by
executing relaxed queries
Use Concept similarity
to measure tuple
similarities
Prune tuples below
threshold
Return Ranked Set
An Example
Imprecise
Query
Convert
Q Map:
“like” to “=”
Derive Base
Set Abs
Qpr = Map(Q)
Abs = Qpr(R)
Use Base Set as set of
relaxable selection
queries
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Relation:- CarDB(Make, Model, Price, Year)
Imprecise query
Q :− CarDB(Model like “Camry”, Price like “10k”)
Base query
Qpr :− CarDB(Model = “Camry”, Price = “10k”)
Base set Abs
Use Concept similarity
to measure tuple
similarities
Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000”
Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2001”
Obtaining Extended Set
Imprecise
Query
Convert
Q Map:
“like” to “=”
Derive Base
Set Abs
Qpr = Map(Q)
Abs = Qpr(R)
Use Base Set as set of
relaxable selection
queries
Use Concept similarity
to measure tuple
similarities
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Problem: Given base set, find tuples from database similar to
tuples in base set.
Solution:
•
Consider each tuple in base set as a selection query.
e.g. Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000”
•
Relax each such query to obtain “similar” precise queries.
e.g. Make = “Toyota”, Model = “Camry”, Price = “”, Year =“2000”
•
Execute and determine tuples having similarity above some threshold.
Challenge: Which attribute should be relaxed first ?
Make ? Model ? Price ? Year ?
Solution: Relax least important attribute first.
•
Least Important Attribute
Definition: An attribute whose binding value when changed has minimal effect on
values binding other attributes.
•
•
Does not decide values of other attributes
Value may depend on other attributes
E.g. Changing/relaxing Price will usually not affect other attributes
but changing Model usually affects Price
Dependence between attributes useful to decide relative importance
•
Approximate Functional Dependencies & Approximate Keys

Approximate in the sense that they are obeyed by a large percentage (but not all)
of tuples in the database
•
Can use TANE, an algorithm by Huhtala et al [1999]
Imprecise
Query
Convert
Q Map:
“like” to “=”
Derive Base
Set Abs
Qpr = Map(Q)
Abs = Qpr(R)
Attribute Ordering
Use Base Set as set of
relaxable selection
queries
Use Concept similarity
to measure tuple
similarities
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Given a relation R
•
•
•
•
Determine the AFDs and Approximate Keys
Pick key with highest support, say Kbest
Partition attributes of R into

key attributes i.e. belonging to Kbest

non-key attributes I.e. not belonging to Kbest
Sort the subsets using influence weights
(1  error ( A'  Aj ))

InfluenceWeight ( Ai) 
| A' |
CarDB(Make, Model, Year,
Price)
Key attributes: Make, Year
Non-key: Model, Price
Order: Price, Model, Year, Make
1- attribute: { Price, Model, Year,
Make}
where Ai ∈ A’ ⊆ R, j ≠ i & j =1 to |Attributes(R)|
Attribute relaxation order is all non-keys first then keys
Multi-attribute relaxation - independence assumption
2-attribute: {(Price, Model), (Price,
Year), (Price, Make)….. }
Tuple Similarity
Imprecise
Query
Convert
Q Map:
“like” to “=”
Derive Base
Set Abs
Qpr = Map(Q)
Abs = Qpr(R)
Use Base Set as set of
relaxable selection
queries
Use Concept similarity
to measure tuple
similarities
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Tuples obtained after relaxation are ranked according to their
similarity to the corresponding tuples in base set
Similarity (t1, t 2)   AttrSimilarity (value(t1[ Ai]), value(t 2[ Ai])) Wi
where Wi = normalized influence weights, ∑ Wi = 1 , i = 1 to
|Attributes(R)|
Value Similarity
• Euclidean for numerical attributes e.g. Price, Year
• Concept Similarity for categorical e.g. Make, Model
Imprecise
Query
Convert
Q Map:
“like” to “=”
Derive Base
Set Abs
Qpr = Map(Q)
Abs = Qpr(R)
Concept (Value) Similarity
Concept: Any distinct attribute value pair.
E.g. Make=Toyota
•
•
Visualized as a selection query binding a
single attribute
Represented as a supertuple
Concept Similarity: Estimated as the percentage
of correlated values common to two given
concepts
Use Base Set as set of
relaxable selection
queries
Use Concept similarity
to measure tuple
similarities
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
ST(QMake=Toyota)
Model
Camry: 3, Corolla: 4,….
Year
2000:6,1999:5 2001:2,……
Price
5995:4, 6500:3, 4000:6
Supertuple for Concept Make=Toyota
Similarity (v1, v2)   Commonality(Correlated (v1, values( Ai)), Correlated (v2, values( Ai)))
where v1,v2 Є Aj, i ≠ j and Ai, Aj Є R
•
Measured as the Jaccard Similarity among
supertuples representing the concepts
JaccardSim(A,B) =
A B
A B
Concept (Value) Similarity Graph
Dodge
Nissan
0.15
0.11
Honda
BMW
0.12
0.22
0.25
0.16
Ford
Chevrolet
Toyota
Imprecise
Query
Convert
Q Map:
“like” to “=”
Derive Base
Set Abs
Qpr = Map(Q)
Abs = Qpr(R)
Empirical Evaluation of
Use Base Set as set of
relaxable selection
queries
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Goal
•
Evaluate the effectiveness of the query relaxation and
concept learning
Setup
•
A database of used cars
CarDB( Make, Model, Year, Price, Mileage, Location, Color)
•
Populated using 30k tuples from Yahoo Autos
Concept similarity estimated for Make, Model, Location, Color
Two query relaxation algorithms

RandomRelax – randomly picks attribute to relax

GuidedRelax – uses relaxation order determined using approximate
keys and AFDs
•
•
Use Concept similarity
to measure tuple
similarities
Imprecise
Query
Convert
Q Map:
“like” to “=”
Derive Base
Set Abs
Qpr = Map(Q)
Abs = Qpr(R)
Evaluating the effectiveness of relaxation
Test Scenario
•
10 randomly selected base queries from CarDB
•
20 tuples showing similarity > Є

0.5 < Є < 1
•
Weighted summation of attribute similarities

Euclidean distance used for Year, Price, Mileage

•
•
Concept Similarity used for Make, Model, Location, Color
Limit 64 relaxed queries per base query

128 max possible – 7 attributes
Efficiency measured using metric
| ExtractedTuples |
Work / Re levantTuple 
| Re levantExtracted |
Use Base Set as set of
relaxable selection
queries
Use Concept similarity
to measure tuple
similarities
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Efficiency of Relaxation in
Random Relaxation
Guided Relaxation
180
Є = 0.7
900
600
500
400
300
100
80
60
40
100
20
0
0
2
3
4
5
6
7
8
Queries
9
Є = 0.5
120
200
1
Є = 0.6
140
Work/Relevant Tuple
700
Work/Relevant Tuple
160
Є= 0.7
Є = 0.6
Є = 0.5
800
10
1
2
3
4
5
6
7
8
Queries
•Average 8 tuples extracted per
relevant tuple for Є =0.5. Increases
to 120 tuples for Є=0.7.
•Average 4 tuples extracted per
relevant tuple for Є=0.5. Goes up to
12 tuples for Є= 0.7.
•Not resilient to change in Є
•Resilient to change in Є
9
10
Summary
An approach for answering imprecise queries over Web database
•
•
•
Mine and use AFDs to determine attribute importance
Domain-independent concept similarity estimation technique
Tuple similarity score as a weighted sum of attribute similarity scores
Empirical evaluation shows
•
•
Reasonable concept similarity models estimated
Set of similar precise queries efficiently identified
Adaptive Information Integration
• Query processing in information integration needs to
be adaptive to:
– Source characteristics
• How is the data spread among the sources?
– User needs
• Multi-objective queries (tradeoff coverage for cost)
• Imprecise queries
• To be adaptive we need, profiles (meta-data) about
sources as well as users
– Challenge: Profiles are not going to be provided..
• Autonomous sources may not export meta-data about data spread!
• Lay users may not be able to articulate the source of their imprecision!
Need approaches that gather (learn) the meta-data they need
Three contributions to
Adaptive Information Integration
• BibFinder
– Learns and uses source coverage and overlap statistics to
support multi-objective query processing
• [VLDB 2003; ICDE 2004; TKDE 2005]
• COSCO
– Adapts the Coverage/Overlap techniques to text collection
selection
•
– Supports imprecise queries by automatically learning
approximate structural relations among data tuples
• [WebDB 2004; WWW 2004]
Although we focus on avoiding retrieval of duplicates,
Coverage/Overlap statistics can also be used to look for duplicates
Current Directions
• Focusing on retrieving redundant records/documents to
improve information quality
– Eg. Multiple view points on the same story, additional details
(e.g. bibtex entry) on a bibliography record
– Our coverage/overlap statistics can be used for this purpose too!
• Learning and exploiting other types of source statistics
– “Density”—the percentage of null values in a record
– “Recency”/ “Freshness”—how recent the results from a source
or likely to be
• These statistics also may vary based on the query type
– E.g. DBLP is more up-to-date for database papers than AI papers
– Such statistics can be used to increase the quality of answers
returned by the mediator in accessing top-K sources.

Download Report

Adaptive Information Integration - Subbarao Kambhampati

Paperzz.com

Your Paperzz