Efficient Discovery of XML
Data Redundancies
Cong Yu and H. V. Jagadish
University of Michigan, Ann Arbor
VLDB 2006, Seoul, Korea
September 12th, 2006
Talk Outline
• Motivating Example
• A Comprehensive Notion of XML FD
• XML Redundancy Discovery Algorithms
• Experimental Evaluation
• Conclusion
2 / 42
An Example XML Document
warehouse
state
state
……
store
store
store
name
name
name
“Borders”
state
book
book
“Amazon”
book
“Borders”
ISBN title price au
ISBN title price au au
au
“… 269” “DB” “$59.9”“R.R.”“J.G.”ISBN title price“… 269” “DB” “$59.9”“R.R.”“J.G.”
“… 269” “DB” “$51.1”
3 / 42
Constraints on XML Data
• An example constraint:
For any two books, if they have the same ISBN, then
they have
the same title.
Target
Condition
Element(s)
Implication
Element(s)
• Similar to Equality Generating Dependencies
(EGDs) [BV84] and Nested EGDs [YP04]
4 / 42
Data Redundancies
• E.g., title is redundantly stored
• Result of “non-optimal” design of the
database schema in the presence of
constraints
• Lead to:
Update anomalies
Increased cost for data transfer and
manipulation
• Constraints are the properties of data
May not be known at the design phase
5 / 42
Goal
Efficiently Discover
Redundancies From the XML
Database By Discovering
Satisfied Constraints
6 / 42
Main Contributions
• A comprehensive notion of XML FD
Capturing a semantically richer set of XML
constraints
Definition of XML data redundancy in terms
of XML FDs and XML Keys
• Efficient algorithms for discovering FDs
and data redundancies from an XML
database
• Experimental Evaluation
7 / 42
Talk Outline
• Motivating Example
• A Comprehensive Notion of XML FD
• XML Redundancy Discovery Algorithms
• Experimental Evaluation
• Conclusion
8 / 42
Example XML Constraints
• Hierarchical: condition and/or implication
elements can come from multiple hierarchies
……
state
store
store
name
name
“Borders”
state
book
book
“Amazon”
store
name
“Borders”
book
ISBN title price au
ISBN title price au au
au
“… 269” “DB” “$59.9”“R.R.”“J.G.” ISBN title price“… 269” “DB” “$59.9”“R.R.”“J.G.”
“… 269” “DB” “$51.1”
10 / 42
Example XML Constraints, Cont’d
• Set elements: condition and/or implication
elements can involve set elements
……
state
store
store
name
name
“Borders”
state
book
book
“Amazon”
store
name
“Borders”
book
ISBN title price au
ISBN title price au au
au
“… 269” “DB” “$59.9”“R.R.”“J.G.” ISBN title price“… 269” “DB” “$59.9”“R.R.”“J.G.”
“… 269” “DB” “$51.1”
11 / 42
Functional Dependencies (FDs)
• FDs are used to describe constraints in
relational databases
• A similar notion of FD is needed for XML
• Challenges:
Target is difficult to specify due to the
hierarchical structure
Set elements introduce new semantics
XML FD needs richer semantics !
12 / 42
Previous Notions
• Path Based Notion [LLL02,VLL04]
Example: {/warehouse/state/store/book/ISBN}
/warehouse/state/store/book/title
Format: LHS RHS
Semantics: for any two RHS nodes, same
(associated) LHS indicates same RHS
• Tree Tuple Based Notion [AL04]
A tree tuple is a data tree, with exactly one data
node for each schema element
Format: LHS RHS
Semantics: for any two tree tuples, same LHS
indicates same RHS
13 / 42
Previous Notions, cont’d
• Both capture hierarchical constraints
• Neither can capture set constraints
• {/store/book/ISBN} /store/book/au
Violated in previous
Satisfied if the two au
nodes are a single set
• {/store/book/title,
store
name
/store/book/au}
/store/book/ISBN
“Borders”
Undefined in previous
Intuitive if au nodes are
a single set
14 / 42
book
ISBN title price au au
“… 269” “DB” “$59.9”“R.R.”“J.G.”
A New Comprehensive Notion
• Generalized Tree Tuple
A data tree constructed around a pivot data
node (np)
Entire subtree rooted at np is kept
All ancestors of np and their “attributes” are
kept
• Tuple Class CP
The set of all generalized tree tuples, whose
pivot nodes share the same path P (called
pivot path)
15 / 42
Example Generalized Tree Tuple
warehouse
Pivot
state
state
……
store
store
store
name
name
name
“Borders”
state
book
book
“Amazon”
book
“Borders”
ISBN title price au
ISBN title price au au
au
“… 269” “DB” “$59.9”“R.R.”“J.G.” ISBN title price“… 269” “DB” “$59.9”“R.R.”“J.G.”
“… 269” “DB” “$51.1”
16 / 42
Example Generalized Tree Tuple
Pivot
warehouse
state
state
……
store
store
store
name
name
name
“Borders”
state
book
book
“Amazon”
book
“Borders”
ISBN title price au
ISBN title price au au
au
“… 269” “DB” “$59.9”“R.R.”“J.G.”ISBN title price“… 269” “DB” “$59.9”“R.R.”“J.G.”
“… 269” “DB” “$51.1”
17 / 42
XML FD
• <CP, LHS, RHS>: LHS RHS w.r.t. CP
• Semantics:
for any two generalized tree tuple t1, t2 in
CP, if they share the same LHS, they
have the same RHS.
• E.g., {./title, ./au} ./ISBN, w.r.t.
C/warehouse/state/store/book
18 / 42
Repeatable Elements Are Special
warehouse
state
state
……
store
store
store
name
name
name
“Borders”
state
book
book
“Amazon”
book
“Borders”
ISBN title price au
ISBN title price au au
au
“… 269” “DB” “$59.9”“R.R.”“J.G.”ISBN title price“… 269” “DB” “$59.9”“R.R.”“J.G.”
“… 269” “DB” “$51.1”
19 / 42
Essential Tuple Classes
• Definition:
Tuple classes with pivot paths that correspond
to repeatable schema elements
C/warehouse/state/store/book is essential
C/warehouse/state/store/name is not
• Express XML FDs that are expressible
with non-essential tuple classes
• See paper for detailed proof
20 / 42
XML Key and Data Redundancy
• Let attribute @key uniquely identify each node
•
•
•
in the entire data tree
<CP, LHS> is an XML Key, when the database
satisfies XML FD: LHS ./@key w.r.t. CP
Similar to the relative key notion proposed in
[BDF+01]
Data redundancy exists if the database:
Satisfies the XML FD <CP, LHS, RHS>,
But <CP, LHS> is not an XML key
RHS is redundantly stored.
23 / 42
Talk Outline
• Motivating Example
• A Comprehensive Notion of XML FD
• XML Redundancy Discovery Algorithms
• Experimental Evaluation
• Conclusion
24 / 42
Strategy
• Discover satisfied XML FDs and Keys
• Data redundancies can then be
discovered based on the definition
• First, we need an efficient
representation of the XML data
25 / 42
Hierarchical Representation of XML Data
• Each essential tuple class a relation
Similar to nested relations [OY87,MNE96]
All relations together form a hierarchy
Tree tuples can be reconstructed by joining @key
with parent
R_state
@key parent
2
root
3
root
18
root
. . . . .
R_store
@key parent
4
3
12
3
19
18
R_book
@key parent
6
4
13 12
20 19
ISBN
…269
…269
…269
title
DB
DB
DB
R_au
@key parent
10
6
11
6
24
20
25
20
name
Borders
Amazon
Borders
26 / 42
price
$59.9
$51.1
$59.9
@text
R.R.
J.G.
R.R.
J.G.
Intra-Relation FDs
• {./ISBN} ./title, w.r.t. C/warehouse/state/store/book
……
state
store
state
store
name
name
“Borders”
book
book
“Amazon”
store
name
“Borders”
book
ISBN title price au
ISBN title price au au
au
“… 269” “DB” “$59.9”“R.R.”“J.G.”ISBN title price“… 269” “DB” “$59.9”“R.R.”“J.G.”
“… 269” “DB” “$51.1”
27 / 42
Inter-Relation FDs
• {../name, ./ISBN} ./price, w.r.t. C/warehouse/state/store/book
……
Present in
R_store
state
store
state
store
name
name
“Borders”
book
book
“Amazon”
store
name
“Borders”
book
ISBN title price au
ISBN title price au au
au
“… 269” “DB” “$59.9”“R.R.”“J.G.”ISBN title price“… 269” “DB” “$59.9”“R.R.”“J.G.”
“… 269” “DB” “$51.1”
Present in
R_book
28 / 42
Overview of the Discovery Process
• Only interested in minimal FDs
• Bottom-Up
• At each relation
Discover intra-relation FDs and Keys
Discover inter-relation FDs and Keys
involving descendant relations
Generate candidate inter-relation FDs and
Keys for examination at the parent level
• Attribute Partition as the basic data
structure
29 / 42
Attribute Partition
• Groups tuples
R_book
@key parent
6
4
13 12
20 19
ISBN
…269
…269
…269
title
DB
DB
DB
price
$59.9
$51.1
$59.9
according to the
attribute value
• ∏{price} for Cbook = { {t6,t20}, {t13} }
∏{@key} for Cbook = { {t6}, {t20}, {t13} }
∏{price, @key} for Cbook = { {t6}, {t20}, {t13} }
• FD: LHS RHS w.r.t. CP is satisfied iff:
∏LHS∪RHS = ∏LHS
30 / 42
Set Attribute Partition
• Generated through
refinement
Initialize ∏{au} for
R_book to be { {t6, t13, t20} }
∏{@text} for R_au =
{ {t10, t24}, {t11, t25} }
•
{ {t6, t20}, {t6, t20} }
∏au for R_book =
{ {t6, t20}, {t13} }
∏au can then be used as
a normal partition
31 / 42
R_au
@key parent
10
6
11
6
24
20
25
20
@text
R.R.
J.G.
R.R.
J.G.
R_book
@key parent
6
4
13 12
20 19
ISBN
…269
…269
…269
title
DB
DB
DB
price
$59.9
$51.1
$59.9
Convert to parent
Refine ∏{au} using
partitions in ∏{@text}
Discovery Algorithms
• DiscoverFD:
Discover intra-relation FDs and Keys
Similar to existing relational algorithms
• DiscoverXFD:
Discover inter-relation FDs and Keys
Key component:
Candidate
inter-relation XML FD generation
32 / 42
Generating Candidate Inter-Relation FDs
• Let P' be a parent relation of P
• Parent satisfaction property
For LHS∪X RHS w.r.t. CP to hold for any
attribute set X in relation P', LHS∪{./parent}
RHS w.r.t. CP must hold
• Child implication property
For LHS∪X RHS w.r.t. CP to be a non-trivial FD
for any attribute set X in relation P', LHS RHS
w.r.t. CP must not hold
• An FD is a candidate inter-relation FD if it
satisfies both properties
33 / 42
Talk Outline
• Motivating Example
• A Comprehensive Notion of XML FD
• XML Redundancy Discovery Algorithms
• Experimental Evaluation
• Conclusion
36 / 42
Real Datasets
• DBLP contains a fair
amount of
redundancy, as
noted earlier in
[AL04] as well
• ~ 10% redundancies
in PIR (measured as
# of redundant
elements over total #
of elements),
schema modification
reported to PIR
37 / 42
Scalability on XMark
• Linear in terms of scale factor (# of elements) – even though
exponential in theory
• Orders of magnitude faster than direct application of a state-ofthe-art relational discovery algorithm
The latter takes over 3 hours to run on XMark scale factor 1
38 / 42
Related Work
• XML Integrity Constraints (FDs and
Keys)
[BDF+01], [LLL02], [FS03]
• XML Normal Form
[AL04], [VLL04]
• Nested Relation Normal Form
[OY87], [MNE96]
• Relational FD discovery
FUN, Dep-Miner, TANE, fdep, FastFDs
39 / 42
Conclusion
• A comprehensive notion of XML FDs and
Keys, capturing set semantics
• A system for for detecting XML data
redundancies through the discovery of
FDs and Keys
• The system is practical for real datasets
and out-performs direct application of
the best available relational algorithm
by orders of magnitude.
41 / 42
Questions ?
42 / 42
© Copyright 2026 Paperzz