June, 2014
Probabilistic Databases - Dan Suciu
A COURSE ON
PROBABILISTIC
DATABASES
Dan Suciu
University of Washington
1
June, 2014
Probabilistic Databases - Dan Suciu
2
Part 1
Motivating Applications
2.
The Probabilistic Data Model
Chapter 2
3.
Extensional Query Plans
Chapter 4.2
4.
The Complexity of Query Evaluation
Chapter 3
5.
Extensional Evaluation
Chapter 4.1
6.
Intensional Evaluation
Chapter 5
7.
Conclusions
Part 4
Part
3
1.
Part 2
Outline
June, 2014
Probabilistic Databases - Dan Suciu
Probabilistic Data Model: Outline
• Review: Relational Data Model
• Incomplete Databases
• Probabilistic Databases (PDB)
• Query semantics
• Block Independent Disjoint
3
June, 2014
Probabilistic Databases - Dan Suciu
4
Review: Relational Data Model
Data:
stored in relations (= tables)
Owner
Location
Name
Object
Object
Time
Loc
Joe
Book302
Laptop77
5:07
Hall
Joe
Laptop77
Laptop77
9:05
Office
Jim
Laptop77
Book302
8:18
Office
Fred
GgleGlass
June, 2014
Probabilistic Databases - Dan Suciu
5
Review: Relational Data Model
Data:
stored in relations (= tables)
Queries: SQL,
Find all owners of objects in the Office
-- SQL: e.g. postgres
SELECT DISTINCT Owner.name
FROM Owner, Location
WHERE Owner.object = Location.object
and Location.loc = ‘Office’
Owner
Location
Name
Object
Object
Time
Loc
Joe
Book302
Laptop77
5:07
Hall
Joe
Laptop77
Laptop77
9:05
Office
Jim
Laptop77
Book302
8:18
Office
Fred
GgleGlass
June, 2014
Probabilistic Databases - Dan Suciu
6
Review: Relational Data Model
Data:
stored in relations (= tables)
Queries: SQL,
Owner
Location
Name
Object
Object
Time
Loc
Joe
Book302
Laptop77
5:07
Hall
Joe
Laptop77
Laptop77
9:05
Office
Jim
Laptop77
Book302
8:18
Office
Fred
GgleGlass
Unions of Conjunctive Queries
Find all owners of objects in the Office
-- SQL: e.g. postgres
SELECT DISTINCT Owner.name
FROM Owner, Location
WHERE Owner.object = Location.object
and Location.loc = ‘Office’
Q(z) = Owner(z,x), Location(x,t,y),y=‘Office’
Note that x,t, are existentially quantified:
Q(z) = ∃x ∃t (Owner(z,x), Location(x,t,’Office’))
June, 2014
Probabilistic Databases - Dan Suciu
7
Review: Relational Data Model
Data:
stored in relations (= tables)
Queries: SQL,
Owner
Location
Name
Object
Object
Time
Loc
Joe
Book302
Laptop77
5:07
Hall
Joe
Laptop77
Laptop77
9:05
Office
Jim
Laptop77
Book302
8:18
Office
Fred
GgleGlass
Unions of Conjunctive Queries
Find all owners of objects in the Office
-- SQL: e.g. postgres
Q(z) = Owner(z,x), Location(x,t,y),y=‘Office’
SELECT DISTINCT Owner.name
FROM Owner, Location
WHERE Owner.object = Location.object
and Location.loc = ‘Office’
Name
Query answer: Q=
Joe
Jim
Note that x,t, are existentially quantified:
Q(z) = ∃x ∃t (Owner(z,x), Location(x,t,’Office’))
June, 2014
Probabilistic Databases - Dan Suciu
8
Review: Relational Data Model
Data:
stored in relations (= tables)
Queries: SQL,
Owner
Location
Name
Object
Object
Time
Loc
Joe
Book302
Laptop77
5:07
Hall
Joe
Laptop77
Laptop77
9:05
Office
Jim
Laptop77
Book302
8:18
Office
Fred
GgleGlass
Unions of Conjunctive Queries
Find all owners of objects in the Office
-- SQL: e.g. postgres
Q(z) = Owner(z,x), Location(x,t,y),y=‘Office’
SELECT DISTINCT Owner.name
FROM Owner, Location
WHERE Owner.object = Location.object
and Location.loc = ‘Office’
Name
Query answer: Q=
Joe
Jim
Note that x,t, are existentially quantified:
Q(z) = ∃x ∃t (Owner(z,x), Location(x,t,’Office’))
Review: Complexity of Query Evaluation
Query Q, database D
• Data complexity:
This course
fix Q, complexity = f(D)
• Query complexity:
fix D, complexity = f(Q)
Moshe Vardi
• Combined complexity: complexity = f(D,Q)
Data complexity is unique to database research
June, 2014
Probabilistic Databases - Dan Suciu
10
Incomplete Database
Definition An Incomplete Database is a finite set
of database instances W = (W1, W2, …, Wn)
Each Wi is called
a possible world
June, 2014
Probabilistic Databases - Dan Suciu
11
Incomplete Database
Definition An Incomplete Database is a finite set
of database instances W = (W1, W2, …, Wn)
W1
W2
Owner
Name
Object
Joe
Book302
Joe
Laptop77
Jim
Laptop77
Fred
GgleGlass
Location
Object
Time
Loc
Laptop77
5:07
Hall
Laptop77
9:05
Office
Book302
8:18
Office
W3
Each Wi is called
a possible world
W4
June, 2014
Probabilistic Databases - Dan Suciu
12
Incomplete Database
Definition An Incomplete Database is a finite set
of database instances W = (W1, W2, …, Wn)
W1
W2
Owner
W3
Owner
Name
Object
Name
Object
Joe
Book302
Joe
Book302
Joe
Laptop77
Jim
Laptop77
Jim
Laptop77
Fred
GgleGlass
Fred
GgleGlass
Location
Location
Object
Time
Loc
Object
Time
Loc
Laptop77
5:07
Hall
Book302
8:18
Office
Laptop77
9:05
Office
Book302
8:18
Office
Each Wi is called
a possible world
W4
June, 2014
Probabilistic Databases - Dan Suciu
13
Incomplete Database
Each Wi is called
a possible world
Definition An Incomplete Database is a finite set
of database instances W = (W1, W2, …, Wn)
W1
W2
Owner
W3
Owner
W4
Owner
Owner
Name
Object
Name
Object
Name
Object
Name
Object
Joe
Book302
Joe
Book302
Jim
Laptop77
Joe
Book302
Joe
Laptop77
Jim
Laptop77
Jim
Laptop77
Jim
Laptop77
Fred
GgleGlass
Fred
GgleGlass
Fred
GgleGlass
Location
Location
Location
Location
Object
Time
Loc
Object
Time
Loc
Object
Time
Loc
Object
Time
Loc
Laptop77
5:07
Hall
Book302
8:18
Office
Laptop77
5:07
Hall
Laptop77
5:07
Hall
Laptop77
9:05
Office
Laptop77
9:05
Office
Laptop77
9:05
Office
Book302
8:18
Office
Book302
8:18
Office
June, 2014
Probabilistic Databases - Dan Suciu
14
Incomplete Database: Query Semantics
Definition Given query Q, incomplete database W:
• An answer t is certain, if ∀Wi, t ∈Q(Wi)
• An answer t is possible if ∃Wi, t ∈Q(Wi)
June, 2014
Probabilistic Databases - Dan Suciu
15
Incomplete Database: Query Semantics
Definition Given query Q, incomplete database W:
• An answer t is certain, if ∀Wi, t ∈Q(Wi)
• An answer t is possible if ∃Wi, t ∈Q(Wi)
W1
Q(z) = Owner(z,x), Location(x,t,’Office’)
W2
Owner
W3
Owner
W4
Owner
Owner
Name
Object
Name
Object
Name
Object
Name
Object
Joe
Book302
Joe
Book302
Joe
Laptop77
Joe
Book302
Joe
Laptop77
Jim
Laptop77
Jim
Laptop77
Jim
Laptop77
Fred
GgleGlass
Fred
GgleGlass
Fred
GgleGlass
Location
Location
Location
Location
Object
Time
Loc
Object
Time
Loc
Object
Time
Loc
Object
Time
Loc
Laptop77
5:07
Hall
Book302
8:18
Office
Laptop77
5:07
Hall
Laptop77
5:07
Hall
Laptop77
9:05
Office
Laptop77
9:05
Office
Laptop77
9:05
Office
Book302
8:18
Office
Book302
8:18
Office
June, 2014
Probabilistic Databases - Dan Suciu
16
Incomplete Database: Query Semantics
Definition Given query Q, incomplete database W:
• An answer t is certain, if ∀Wi, t ∈Q(Wi)
• An answer t is possible if ∃Wi, t ∈Q(Wi)
W1
Q(z) = Owner(z,x), Location(x,t,’Office’)
W2
Owner
W3
Owner
W4
Owner
Owner
Name
Object
Name
Object
Name
Object
Name
Object
Joe
Book302
Joe
Book302
Joe
Laptop77
Joe
Book302
Joe
Laptop77
Jim
Laptop77
Jim
Laptop77
Jim
Laptop77
Fred
GgleGlass
Fred
GgleGlass
Fred
GgleGlass
Location
Location
Location
Location
Object
Time
Loc
Object
Time
Loc
Object
Time
Loc
Object
Time
Loc
Laptop77
5:07
Hall
Book302
8:18
Office
Laptop77
5:07
Hall
Laptop77
5:07
Hall
Laptop77
9:05
Office
Laptop77
9:05
Office
Laptop77
9:05
Office
Book302
8:18
Office
Book302
8:18
Office
Q=
Joe
Jim
Q=
Joe
Q=
Joe
Q=
Joe
Jim
June, 2014
Probabilistic Databases - Dan Suciu
17
Incomplete Database: Query Semantics
Definition Given query Q, incomplete database W:
• An answer t is certain, if ∀Wi, t ∈Q(Wi)
• An answer t is possible if ∃Wi, t ∈Q(Wi)
W1
Q(z) = Owner(z,x), Location(x,t,’Office’)
Certain answers to Q: Joe
Possible answers to Q: Joe, Jim
W2
Owner
W3
Owner
W4
Owner
Owner
Name
Object
Name
Object
Name
Object
Name
Object
Joe
Book302
Joe
Book302
Joe
Laptop77
Joe
Book302
Joe
Laptop77
Jim
Laptop77
Jim
Laptop77
Jim
Laptop77
Fred
GgleGlass
Fred
GgleGlass
Fred
GgleGlass
Location
Location
Location
Location
Object
Time
Loc
Object
Time
Loc
Object
Time
Loc
Object
Time
Loc
Laptop77
5:07
Hall
Book302
8:18
Office
Laptop77
5:07
Hall
Laptop77
5:07
Hall
Laptop77
9:05
Office
Laptop77
9:05
Office
Laptop77
9:05
Office
Book302
8:18
Office
Book302
8:18
Office
Q=
Joe
Jim
Q=
Joe
Q=
Joe
Q=
Joe
Jim
June, 2014
Probabilistic Databases - Dan Suciu
18
Probabilistic Database
Definition A Probabilistic Database is (W, P), where W is an incomplete
database, and P: W [0,1] a probability distribution: Σi=1,n P(Wi) = 1
June, 2014
Probabilistic Databases - Dan Suciu
19
Probabilistic Database
Definition A Probabilistic Database is (W, P), where W is an incomplete
database, and P: W [0,1] a probability distribution: Σi=1,n P(Wi) = 1
W1
W2
0.3
Owner
W3
0.4
Owner
W4
0.2
Owner
0.1
Owner
Name
Object
Name
Object
Name
Object
Name
Object
Joe
Book302
Joe
Book302
Jim
Laptop77
Joe
Book302
Joe
Laptop77
Jim
Laptop77
Jim
Laptop77
Jim
Laptop77
Fred
GgleGlass
Fred
GgleGlass
Fred
GgleGlass
Location
Location
Location
Location
Object
Time
Loc
Object
Time
Loc
Object
Time
Loc
Object
Time
Loc
Laptop77
5:07
Hall
Book302
8:18
Office
Laptop77
5:07
Hall
Laptop77
5:07
Hall
Laptop77
9:05
Office
Laptop77
9:05
Office
Laptop77
9:05
Office
Book302
8:18
Office
Book302
8:18
Office
June, 2014
Probabilistic Databases - Dan Suciu
20
Probabilistic Database: Query Semantics
Definition Given query Q, probabilistic database (W,P):
• The marginal probability of an answer t is:
P(t) = Σ { P(Wi) | Wi ∈ W, t ∈Q(Wi) }
June, 2014
Probabilistic Databases - Dan Suciu
21
Probabilistic Database: Query Semantics
Definition Given query Q, probabilistic database (W,P):
Q(z) = Owner(z,x), Location(x,t,’Office’)
• The marginal probability of an answer t is:
P(t) = Σ { P(Wi) | Wi ∈ W, t ∈Q(Wi) }
W1
W2
0.3
Owner
W3
0.4
Owner
W4
0.2
Owner
0.1
Owner
Name
Object
Name
Object
Name
Object
Name
Object
Joe
Book302
Joe
Book302
Jim
Laptop77
Joe
Book302
Joe
Laptop77
Jim
Laptop77
Jim
Laptop77
Jim
Laptop77
Fred
GgleGlass
Fred
GgleGlass
Fred
GgleGlass
Location
Location
Location
Location
Object
Time
Loc
Object
Time
Loc
Object
Time
Loc
Object
Time
Loc
Laptop77
5:07
Hall
Book302
8:18
Office
Laptop77
5:07
Hall
Laptop77
5:07
Hall
Laptop77
9:05
Office
Laptop77
9:05
Office
Laptop77
9:05
Office
Book302
8:18
Office
Book302
8:18
Office
June, 2014
Probabilistic Databases - Dan Suciu
22
Probabilistic Database: Query Semantics
Definition Given query Q, probabilistic database (W,P):
Q(z) = Owner(z,x), Location(x,t,’Office’)
• The marginal probability of an answer t is:
P(t) = Σ { P(Wi) | Wi ∈ W, t ∈Q(Wi) }
W1
W2
0.3
Owner
W3
0.4
Owner
W4
0.2
Owner
0.1
Owner
Name
Object
Name
Object
Name
Object
Name
Object
Joe
Book302
Joe
Book302
Joe
Laptop77
Joe
Book302
Joe
Laptop77
Jim
Laptop77
Jim
Laptop77
Jim
Laptop77
Fred
GgleGlass
Fred
GgleGlass
Fred
GgleGlass
Location
Location
Location
Location
Object
Time
Loc
Object
Time
Loc
Object
Time
Loc
Object
Time
Loc
Laptop77
5:07
Hall
Book302
8:18
Office
Laptop77
5:07
Hall
Laptop77
5:07
Hall
Laptop77
9:05
Office
Laptop77
9:05
Office
Laptop77
9:05
Office
Book302
8:18
Office
Book302
8:18
Office
Q=
Joe
Jim
Q=
Joe
Q=
Joe
Q=
Joe
Jim
June, 2014
Probabilistic Databases - Dan Suciu
23
Probabilistic Database: Query Semantics
Definition Given query Q, probabilistic database (W,P):
Q(z) = Owner(z,x), Location(x,t,’Office’)
• The marginal probability of an answer t is:
P(t) = Σ { P(Wi) | Wi ∈ W, t ∈Q(Wi) }
P(Joe) = 1.0
P(Jim) = 0.4
W1
W2
0.3
Owner
W3
0.4
Owner
W4
0.2
Owner
0.1
Owner
Name
Object
Name
Object
Name
Object
Name
Object
Joe
Book302
Joe
Book302
Joe
Laptop77
Joe
Book302
Joe
Laptop77
Jim
Laptop77
Jim
Laptop77
Jim
Laptop77
Fred
GgleGlass
Fred
GgleGlass
Fred
GgleGlass
Location
Location
Location
Location
Object
Time
Loc
Object
Time
Loc
Object
Time
Loc
Object
Time
Loc
Laptop77
5:07
Hall
Book302
8:18
Office
Laptop77
5:07
Hall
Laptop77
5:07
Hall
Laptop77
9:05
Office
Laptop77
9:05
Office
Laptop77
9:05
Office
Book302
8:18
Office
Book302
8:18
Office
Q=
Joe
Jim
Q=
Joe
Q=
Joe
Q=
Joe
Jim
June, 2014
Probabilistic Databases - Dan Suciu
24
Discussion
• Intuition: a probabilistic database says that the database
can be in one of possible states, each with a probability
• Possible query answers: a set of answers annotated with
probabilities:
(t1, p1), (t2, p2), (t3, p3), …
Usually: p1 ≥ p2 ≥ p3 ≥ …
• Problem: the number of possible world in a probabilistic
database is astronomically large. To represent it, we
impose some restrictions
June, 2014
Probabilistic Databases - Dan Suciu
Independent, Disjoint Tuples
Definition Given a probabilistic database (W, P).
Two tuples t1, t2 are called:
• Independent, if:
P(t1 t2) = P(t1) P(t2)
• Disjoint (or exclusive), if:
P(t1t2) = 0
25
June, 2014
Probabilistic Databases - Dan Suciu
Independent, Disjoint Tuples
Definition Given a probabilistic database (W, P).
Two tuples t1, t2 are called:
• Independent, if:
P(t1 t2) = P(t1) P(t2)
• Disjoint (or exclusive), if:
P(t1t2) = 0
Definition A probabilistic database is called
Block-Independent-Disjoint (BID), if its tuples are grouped
into blocks such that:
• Tuples from the same block are disjoint
• Tuples from different blocks are independent
26
June, 2014
Example: BID Table
BID Table
Object
Time
Loc
P
Laptop77
9:07
Rm444
p1
Laptop77
9:07
Hall
p2
Book302
9:18
Office
p3
Book302
9:18
Rm444
p4
Book302
9:18
Lift
p5
Disjoint
Independent
Disjoint
Object
Time Loc
Object
Time Loc
Laptop77 Object
9:07
Rm444
Time Loc
Laptop77 Object
9:07
Rm444
Time Loc
Book302 Laptop77
9:18
Office
9:07
Object Rm444
Time Loc
Book302 Laptop77
9:18
Rm444
9:07
Hall
Object
Time Loc
Book302 Laptop77
9:18
Lift9:07
Hall
Book302 Laptop77
9:18Object
Office
9:07 Time
Hall Loc
Object
1 3
Book302
9:18
Rm444 Time Loc
Laptop77
9:07Lift Rm444
Time Loc
1 4
Book302
9:18Object
Laptop77 Object
9:07
Hall
Time Loc
Book302 Object
9:18
Office
Time Loc
Book302 Object
9:18
Rm444
Time Loc
Book302
9:18
Lift
1
3 4 5
W={
pp
27
Probabilistic Databases - Dan Suciu
}
pp
p (1- p -p -p )
Possible
worlds
June, 2014
Probabilistic Databases - Dan Suciu
28
The Query Evaluation Problem
Given: a BID database D, a query Q, and output tuple t
Compute: P(t)
Note: D has, say, 1000000 tuples,
while the number of possible worlds is 21000000
Challenge: compute P(t) efficiently, in the size of D
Data complexity: the complexity of P depends dramatically on Q
June, 2014
Probabilistic Databases - Dan Suciu
An Example
29
Boolean query
SELECT DISTINCT ‘true’
FROM R, S
WHERE R.x = S.x
Q() = R(x), S(x,y)
P(Q) = 1- {1- p1*[ 1-(1-q1)*(1-q2) ] } *
{1- p2*[ 1-(1-q3)*(1-q4)*(1-q5) ] }
One can compute P(Q) in PTIME
in the size of the database D
R
x
a1
P
p1
a2
a3
p2
p3
S
x
a1
y
b1
P
q1
a1
b2
q2
a2
a2
b3
b4
q3
q4
a2
b5
q5
June, 2014
Probabilistic Databases - Dan Suciu
30
Summary of the Probabilistic Data Model
• Possible Worlds Semantics: very powerful but difficult to
represent
• Block-Independent-Disjoint databases have efficient
representation: D is stored in a traditional database
• Independent Databases: even simpler
This tutorial
Main challenge: evaluate Q efficiently in the size of D
© Copyright 2026 Paperzz