Presentation Title Goes Here

Scan-Sharing for Optimizing RDF Graph
Pattern Matching on MapReduce
HyeongSik Kim, Padmashree Ravindra, Kemafor Anyanwu
{hkim22, pravind2, kogan}@ncsu.edu
COUL – Semantic COmpUting
research Lab
Outline
 Background
 RDF Graph Pattern Matching
 Graph Pattern Matching on MapReduce
 Queries with Repeated Properties (QRP)
 Nested Triplegroup Algebra (NTGA)
 Challenges: Processing QRP with NTGA
 Approach: TripleGroup Cloning
 Well-formed, Ambiguous, and Perfect TripleGroups
 TripleGroup Cloning in TG_GroupFilter
 Evaluation
 Related Work
The Growing Amount of RDF data
 The amount of RDF on the web is rapidly growing.
 Example: DBPedia (http://dbpedia.org)
 A dataset extracted from Wikipedia.
 Contains 1 billion RDF triples.
 Linked Data on the web:
May 2007 - # of datasets: 12
Sep 2011 - # of datasets:295
Growing #RDF triples: currently 31 billion
RDF Data Model
(Resource Description Framework)
 How is knowledge represented in the Semantic Web?
 e.g., Information on mobile device products.
 Resource Description Framework (RDF) is used.
 W3C standard data model for the Semantic web
as Ex. “product1 has a name called iphone4” as RDF.
 Represent information as a
form of triple.
 A subject as “product1”
 A property as “name”
 An object as “iphone4”
(:Product1, :name, :iphone4)
$499
:price
“iphone4”
:Product1
:name
:date
:design
:homepage
www.apple.com
 Data model is a directed labeled graph.
 Node: subject, object
“iphone5”
 Labeled edge: property
:Producer1
:design
:name
:Product2
:date
Processing RDF Query
(from the Viewpoint of Graph Pattern Matching)
2. Example RDF Query:
1. Example RDF Dataset:
Example Data: RDF graph on mobile devices
SELECT * WHERE{
?product :name
?product
:name?productName
?productName
.
$499
$499
?product :price
?product
:price?productPrice
?productPrice
. .
:price
“iphone4”
“iphone4”
:name
www.apple.com
www.apple.com
:Product1
:Product1
:date
:design
:homepage
“2011-10-14”
“2011-10-14”
}
(Three) Triple Patterns
:Producer1
:Producer1
:design
“iphone5”
“iphone5”


:name
:Product2
:Product2
Oval: Resources in the Web
Rectangle: Literals
:date
?product :date
?product
:date?productDate
?productDate
. .
“2012-09-12”
“2012-09-12”
Graph Pattern
 Query Variable is denoted with a
question mark (e.g., ?product)
 A star pattern whose subject
variable is ?product
Processing RDF Query
(based on Relational Algebra)
1. Example RDF Dataset
2. Example RDF Query:
SELECT * WHERE{
First scan
Second
Third
scan
scan
ofofrelation
ofrelation
relation
RR R
Relation R
Subject
?product :name
?product
:name?productName
?productName
. .
?product :price
?product
:price?productPrice
?productPrice
. .
?product :date
?product
:date?productDate
?productDate
. .
}
Property
Object
:Product1
:price
“$499”
:Product1
:name
“iphone 4”
:Product1
:date
“2011-10-14”
…
…
…
:Product2
:name
“iphone 5”
:Product2
:date
“2012-09-12”
…
…
…
3. Conceptual Execution Plan
⋈
(Subject ==Subject)
(Subject
Subject)
⋈
(Subject ==Subject)
(Subject
Subject)
𝝈(𝑷𝒓𝒐𝒑𝒆𝒓𝒕𝒚
𝜎(𝑃𝑟𝑜𝑝𝑒𝑟𝑡𝑦
𝝈(𝑷𝒓𝒐𝒑𝒆𝒓𝒕𝒚
𝜎(𝑃𝑟𝑜𝑝𝑒𝑟𝑡𝑦 𝝈(𝑷𝒓𝒐𝒑𝒆𝒓𝒕𝒚
𝜎(𝑃𝑟𝑜𝑝𝑒𝑟𝑡𝑦
==":𝒏𝒂𝒎𝒆")(𝑹)
":𝑛𝑎𝑚𝑒")(𝑅) ==":𝒑𝒓𝒊𝒄𝒆")(𝑹)
":𝑝𝑟𝑖𝑐𝑒")(𝑅) = ":𝒅𝒂𝒕𝒆")(R)
":𝑑𝑎𝑡𝑒")(R)
 Implicit joins on ?product
4. (Intermediate) Result:
(:Product1, :name,
(:Product1,
:name,“iphone
“iphone4”)
4”, :Product1, :price, “$499”,
“$499”) :Product1, :date, “2011-10-14”)
(:Product2, :name, “iphone 5”)
Overview of MapReduce
 MapReduce (MR): Large-scale data processing systems running on a cluster of
machines. [DEAN04]
 Encode tasks in terms of low level code as map/reduce functions, which are
executed in parallel across the cluster.
1.Map(k1,v1) → list(k2,v2)
[NYKIEL10]
M1
2. Reduce(k2, list (v2)) → list(v3)
Disk
↓
↓
HDFS
𝐛𝟏,𝟏
☼
R1
𝐛𝟏,𝟐
Disk
HDFS
M2
↓
↓
☼
𝐛𝟐,𝟏
𝐛𝟐,𝟐
R2
↓
= sort
☼
= merge
Reduce(𝐽)
𝑇𝑠𝑜𝑟𝑡−𝑟𝑒𝑑
+(𝐽)
𝑇𝑟𝑒𝑑𝑢𝑐𝑒
(𝐽) + 𝑇𝑤𝑟𝑖𝑡𝑒
Map(𝐽) ==𝑇𝑟𝑒𝑎𝑑
(𝐽) + (𝐽)
𝑇𝑚𝑎𝑝
+ 𝑇𝑠𝑜𝑟𝑡−𝑚𝑎𝑝
(𝐽) (𝐽)
Disk
M3
↓
↓
𝐛𝟑,𝟏
𝐛𝟑,𝟐






𝑻
(𝑱): sort and merge input.
𝑻𝒔𝒐𝒓𝒕−𝒓𝒆𝒅
𝒓𝒆𝒂𝒅 (𝑱) : read the data.
𝑇
(𝐽):
execute
user’s
reduce
function.
𝑟𝑒𝑑𝑢𝑐𝑒
𝑇𝑚𝑎𝑝
(𝐽):
execute
user’s
map
function.
𝑻
(𝑱): transfer result to HDFS.
𝑻𝒘𝒓𝒊𝒕𝒆
𝒔𝒐𝒓𝒕−𝒎𝒂𝒑 (𝑱): sort and write intermediate data
Join Processing on MapReduce
[BLANAS10]
 Example:
 Equi-join operation with the first column of relation L and R
M1
HDFS
L
R1
(k1,v5)
(k1, (L: k1, v5))
(k2,v1)
(k2, (R: k2, v1))
(k1, v5)
(k2, v4)
(k3, v6)
R2
(k2, ((L: k2, v4),
(R: k2, v1))
M2
R
(k2, v1)
(k1, ((L: k1, v5))
(k2,v4)
(k2, (L: k2, v4))
(k3,v6)
(k3, (R: k3, v6))
Map:
 Extract the join column
 Add a tag of either L or R
 Annotate tuples with join key
(k2, v4, k2, v1)
R3
(k3, ((R: k3, v6))
Result:
(k2, v4, k2, v1)
Reduce:
 Separate and buffer the input
records into two sets according to
the table tag (L or R)
 Perform a cross-product
Processing Multi-Join Query on MapReduce
1. (Extended) Example Query
3.
2. Vertical
Corresponding
Partitioning
Logical
(VP):
Plan [ABADI07]
based on VP
[ABADI07]
⋈
SELECT * WHERE{
Partition relation R vertically based on the value of
?product :name ?productName . the property attribute. (subject = subject)
⋈
E.g., property(subject
relation
name, price, design, and type
= object)
can be generated using selection or split operators.
⋈
?product :price ?productPrice .
?producer :design ?product .
?producer :type ?ProducerType .
(subject = subject)
name = 𝜎(𝑃𝑟𝑜𝑝𝑒𝑟𝑡𝑦 = ":𝑛𝑎𝑚𝑒")(𝑅)
price = 𝜎(𝑃𝑟𝑜𝑝𝑒𝑟𝑡𝑦
= ":𝑝𝑟𝑖𝑐𝑒")(𝑅)
name
price
type
design
design = 𝜎(𝑃𝑟𝑜𝑝𝑒𝑟𝑡𝑦 = ":𝑑𝑒𝑠𝑖𝑔𝑛")(𝑅)
type = 𝜎(𝑃𝑟𝑜𝑝𝑒𝑟𝑡𝑦 = ":𝑡𝑦𝑝𝑒")(𝑅)
}
4. MapReduce Plans
MR Job: 𝐽1
MR Job: 𝐽2
temp1
⋈
(subject = subject)
name
price
MR Job: 𝐽3
temp2
output
⋈
(subject = subject)
⋈
(subject = object)
temp1
design
temp2
type
Cost = 𝑀𝑎𝑝 𝐽1 + 𝑅𝑒𝑑𝑢𝑐𝑒 𝐽1 + 𝑀𝑎𝑝 𝐽2 + 𝑅𝑒𝑑𝑢𝑐𝑒 𝐽2 + 𝑀𝑎𝑝(𝐽3 ) + 𝑅𝑒𝑑𝑢𝑐𝑒(𝐽3 )
Query Optimization on MapReduce
 Heuristic to group operations -> fewer MR jobs in a workflow.
 Group multiple join operations on the same key in same MR cycle. (Pig)
1. (Extended) Example Query
2. Corresponding Logical Plan based on VP
⋈
SELECT * WHERE{
?product :name ?productName .
?product :price ?productPrice .
?product :date ?productDate .
(subject = object)
⋈
⋈
(subject = subject)
⋈
(subject = subject)
?producer :design ?product .
?producer :type ?ProducerType .
}
name
MR Job: 𝐽2
MR Job: 𝐽1
temp1
⋈
(subject = subject)
name
price
price date
date
design
type
MR Job: 𝐽3
temp2
output
⋈
(subject = object)
⋈
(subject = subject)
design
(subject = subject)
type
temp1
temp2
 Finding optimal grouping is NP-hard; more advanced techniques use greedy approach
that groups non-conflicting joins as much as possible. [HUSAIN11]
Queries with “Repeated” Properties
1. Example Query
SELECT * WHERE{
 Query: We want to see the list of the products with detail
information and its producer information as well (e.g., the
company name, the type of company, and its foundation date)
?product :name ?prodName .
?product :type ?prodType.
?product :date ?prodDate .
?product :price ?prodPrice .
J2
TS(price)
HDFS
HDFS
price
J1
?producer :design ?product .
?producer :name ?prcName .
?producer :type ?prcType .
?producer :date ?prcDate .
}
TS(R)
name
TS(date)
JOIN
J4
HDFS
TS(price, name, …
(price, name, …)
SPLIT
type
date
design
TS: TableScan (Load) operator
TS(name)
TS(type)
JOIN
J3
TS(name)
(name, type, …)
TS(type)
TS(date)
TS(design)
TS(name, type, …
JOIN
 Issue: name, type, date are scanned repeatedly across MR jobs J2, J3
 Possible Optimization Considerations:
 Minimize Scan overhead using indexes.
 MapReduce does not support any indexes by default.
 Buffer such relations across multiple joins (memory intensive)
 Another approach : Algebraic Optimization
 Rewrite queries to equivalent queries but less expensive ones.
General Intuition in NTGA
Nested TripleGroup Algebra (NTGA) : Re-interpret multiple starjoins as a grouping operation

leads to “groups of Triples” (TripleGroups) instead of n-tuples [RAVINDRA11]
1. Example Query
SELECT * WHERE {
1: ?x :p1 ?o1 .
2: ?x :p2 ?o2 .
3: ?y :p3 ?o2 .
4: ?y :p4 ?o3 .
}
2. Input Triples
(:s1, :p1, :o1)
Subject
Property
Object
:s1
:p1
:o1
:s1
:p2
:o2
:s2
:p3
:o3
:s2
:p4
:o4
…
…
…
p1⋈tg(subject=subject)
p2 :o2)
1=
(:s1, :p2,
t1 =(:s1, :p1, :o1, :s2, p2, o2)
p3⋈ (subject=subject)
(:s2, p4
:p3, :o3)
tg2 =
(:s2, :p4, :o4)
t2 =(:s2, :p3, :o3, :s3, p4, o4)
 different structure BUT
“content equivalent”
 VP: 1MR job for each star pattern → 2MR jobs!
 each MR job for star pattern whose subject variable ?x, ?y
 NTGA: 1MR job for all star patterns!
Processing RDF Query with NTGA
J2
1. Example Query
TS(price)
HDFS
HDFS
SELECT * WHERE{
J1
?product :name ?prodName .
TS(R)
?product :type ?prodType.
?product :date ?prodDate .
?product :price ?prodPrice . SPLIT
name
J4
TS(price, name, …
HDFS
TS(date)
(price, name, …)
type
date
?producer :design ?product .
?producer :name ?prcName .
?producer :type ?prcType .
?producer :date ?prcDate .
}
price
JOIN
TS(name)
TS(type)
design
JOIN
J3
TS(name)
(name, type, …)
TS(type)
TS(date)
TS(design)
TS(name, type, …
JOIN
4 MR jobs (4 HDFS reads)
VP:
NTGA: 2 MR jobs (2 HDFS reads)
TS: TableScan (Load) operator
HDFS
J1
TS(R)
TG_GroupBy
TG_GroupFilter
HDFS
:name
:type
:date
:price
:design
:name
:type
:date
J2
TS
(Rpltd)
TG_JOIN
TS
(Rltds)
TG_Unnest
TG_Flatten
A "Key" NTGA Operator: TG_GroupFilter.
 Retain only TripleGroups that satisfy the required query sub
structure
 Check “exact” match between a set of property in star patterns and a
TripleGroup
 Example Query:
 Input TripleGroups:
{
SELECT * WHERE {
1: ?x :p1 :o1 .
2: ?x :p2 ?y .
3: ?y :p3 :o2 .
4: ?y :p4 :o3 .
tg1 =
(:p1, :p2)
= (:p1, :p2)
≠≠
(:p1, :p2)
(:p3, :p4)
(:p1, :p2)
(:p2, :p3)
}
(:p1, :p2)
= : Matched
≠ : Not matched
≠
(:p2, :p3)
(:s1, :p1, :o1)
(:s1, :p2, :o2)
Correct match.
,
Therefore, tg1 passes.
tg2 =
(:s2, :p2, :o2)
(:s2, :p3, :o3)
}
No Matches.
Therefore, tg2 filtered out.
Outline
 Background
 RDF Graph Pattern Matching
 Graph Pattern Matching on MapReduce
 Queries with Repeated Properties (QRP)
 Nested Triplegroup Algebra (NTGA)
 Challenges: Processing QRP with NTGA
 Approach: TripleGroup Cloning
 Well-formed, Ambiguous, and Perfect TripleGroups
 TripleGroup Cloning in TG_GroupFilter
 Evaluation
 Related Work
TG_GroupFilter Semantics and
Repeated Properties.
 Assumes 1-1 correspondence between TripleGroups and star
subpatterns.
 But with repeated properties there can be ambiguities
2. A triplegroup from TG_GroupBy
1. Given triple pattern
SELECT * WHERE{
?product :name ?prodname .
?product :type ?prodType.
?product :date ?prodDate .
?product :price ?prodPrice .
?producer :design ?product .
?producer :name ?prcName .
?producer :type ?prcType .
?producer :date ?prcDate .
}
?
Stp1
?
Stp2
tg0 =
s1 :type
s1 :name
s1 :date
s1 :price
s1 :design
(Partial Match with stp1 and stp2)
o1
o2
o3
o4
o5
Overview of the Solution
 Issue: Mappings between TripleGroups and star patterns
become ambiguous if repeated properties exist across multiple
star patterns.
 Goal: Produce TripleGroups that can be a exact match with a
star pattern in a query.
 Solution: Classify the filtering processing into two steps.
1. Remove out incomplete TripleGroups that do not match with
any star patterns (or eliminate Non-well-formed TripleGroups)
2. Solve the ambiguity of remaining TripleGroups that may
match with multiple star patterns (Ambiguous TripleGroup) and
generate TripleGroups that can be an exact match with a star
pattern (Perfect TripleGroup)
Well-formed TripleGroup
 Well-formed TripleGroup: a TripleGroup consisting of triples
which contains all the properties of some star subpattern.
1. Example Query
2. TripleGroups generated from TG_GroupBy
SELECT * WHERE{
?product :name ?prodname .
?product :date ?prodDate .
?product :price ?prodPrice .
?producer :design ?product .
?producer :name ?prcname .
?producer :date ?prcdate .
tg1=
stp1
stp2
tg2=
}
tg3=
s1 :name :o1
s1 :date :o2
s1 :price :o3
s1 :name :o1
s1 :date :o2
s1 :price :o3
s1 :design :o4
s1 :name :o4
s1 :design :o3
well-formed
(contain properties
from 𝑠𝑡𝑝1 )
well-formed
(contain properties
from 𝑠𝑡𝑝1 , 𝑠𝑡𝑝2 )
NOT well-formed
(Not contain all the
properties from 𝑠𝑡𝑝2 )
Ambiguous&Perfect TripleGroup
 Ambiguous TripleGroup : a well-formed TripleGroup that can be
matched with multiple star subpatterns in a query, e.g. tg2
 Perfect TripleGroup : a well-formed TripleGroup which is an
exact match for a single star pattern.* (valid intermediate answers)
1. Example Query
2. TripleGroups generated from TG_GroupBy
SELECT * WHERE{
?product :name ?prodname .
?product :date ?prodDate .
?product :price ?prodPrice .
?producer :design ?product .
?producer :name ?prcname .
?producer :date ?prcdate .
}
* a single star pattern “class”
tg1=
stp1
stp2
tg2=
s1 :name :o1
s1 :date :o2
s1 :price :o3
Perfect
TripleGroup
(“exact” match
with 𝑠𝑡𝑝1 )
s1 :name :o1
s1 :date :o2
s1 :price :o3
s1 :design :o4
Ambiguous
TripleGroup
(can be matched
with 𝑠𝑡𝑝1 , 𝑠𝑡𝑝2 )
Dealing with Ambiguous TripleGroups
Perfect triplegroups tg1 and tg 2 are cloned from the ambiguous triplegroup tg 0
and the non-perfect triplegroup tg 3 is rejected.
SELECT * WHERE{
?product :name ?prodname .
?product :date ?prodDate .
?product :price ?prodPrice .
?producer :design ?product .
?producer :name ?prcname .
?producer :date ?prcdate .
?seller :sell
?seller :name
}
?product
?selName
Perfect TripleGroup
stp1
stp2
tg0=
(:name, :date, :price)
stp3
Clone
Ambiguous TripleGroup
s1 :name
s1 :date
s1 :price
s1 :design
Clone
:o1
:o2
:o3
:o4
s1 :name :o1
tg1= s1 :date :o2
s1 :price :o3
s1 :design :o4
s1 :name :o1
tg2=
s1 :date :o2
(:design, :name, :date)
Clone
(:sell,:name)
tg3=
s1 :sell ??
s1 :name :o1
NTGA-based MapReduce Plan
 Example Query
SELECT * WHERE{
?product :name ?prodname .
?product :date ?prodDate .
?product :price ?prodPrice .
?producer :design ?product .
?producer :name ?prcname .
?producer :date ?prcdate .
?seller :sell
?seller :name
}
 Generated MR Plan
 Clone in TG_GroupFilter
J1
J1: Map
m:TG_GroupBy
{ tg0=
J1: Reduce
r:TG_GroupBy
r:TG_GroupFilter*
(Revised)
?product
?selName
J2
{
s1 :name :o1
tg1= s1 :date :o2
s1 :price :o3
m:TG_JOIN
(?o1 = ?o1)
J2: Reduce
r:op :Reduce-side Operator
:o1
:o2
}
:o3
:o4
(clone)
J2: Map
m:op : Map-side Operator
s1 :name
s1 :date
s1 :price
s1 :design
r:TG_JOIN
(…)
tg2=
s1 :design :o4
s1 :name :o1
s1 :date :o2
,
}
Losslessness of Revised TG_Groupfilter.
 Filter out non-well-formed TripleGroup.
 Incomplete TripleGroup that does not contain all the properties for any
star patterns clearly does not match any star patterns in a query.
 Generate multiple Perfect TripleGroups from an ambiguous TripleGroups.
 Example Dataset
Subject
Property
Object
:s1
:price
:o1
:s1
:name
:o2
:s1
:date
:o3
:s1
:design
:o4
…
…
…
1. Relational Algebra (VP)
1) name ⋈(subject=subject) date
⋈(subject=subject) price
t1 = (:s1, :name, :o1, :s1, :date, :o2, :s1, :price, :o3)
2) design ⋈(subject=subject) name
⋈(subject=subject) date
t2 = (:s1, :design, :o4, :s1, :name, :o1, :s1, :price, :o3)
2. NTGA
tg0=
s1 :name
s1 :date
s1 :price
s1 :design
t1 ≅ tg1, t2 ≅ tg2
:o1
:o2
:o3
:o4
(clone) tg1=
s1 :name :o1
s1 :date :o2
s1 :price :o3
tg2=
s1 :design :o4
s1 :name :o1
s1 :date :o2
∴ No valid intermediate results are destroyed nor
are spurious results introduced by cloning.
,
Outline
 Background
 RDF Graph Pattern Matching
 Graph Pattern Matching on MapReduce
 Queries with Repeated Properties (QRP)
 Nested Triplegroup Algebra (NTGA)
 Challenges: Processing QRP with NTGA
 Approach: TripleGroup Cloning
 Well-formed, Ambiguous, and Perfect TripleGroups
 TripleGroup Cloning in TG_GroupFilter
 Evaluation
 Related Work
Setup and TestBed
 Setup:
 Implement VP and NTGA on top of Apache Pig.
 10-node Hadoop clusters on NCSU’s VCL*.
 Three approaches were considered :
 1-join-per-cycle (SHARD) [ROHLOFF10]
 1-star-join-per-cycle (Pig-Def or VP)
 all-star-joins-1-cycle (NTGA)
 Evaluation of the redundant scans during star-join computations.
 Task 1a – varying the ratio of repeated properties to fixed ones.
 Task 1b – varying the selectivity of repeated properties.
 Task 2 – scaling up sub patterns with repeated properties.
 Task 3 – scalability test with varying data size
*https://vcl.ncsu.edu
Dataset
Dataset: Synthetic benchmark dataset generated using BSBM*
- From 22GB (250k Products, BSBM-250k ~86M triples)
- Up to 87GB (1M Products, BSBM-1000k ~350M triples)
 7 repeated properties:
- Across all classes e.g. type, publisher
- Only for a smaller subset of classes, e.g. name
 The size and selectivity ** of BSBM-250k :
:publisher - 1.7GB, 0.091
:type - 1.8GB, 0.105
:name - 49MB, 0.003
:date - 1.4GB, 0.091
* http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/
|𝑇 |
∗∗ Selectivity(P) = |𝑇|𝑃 , 𝑇𝑃 denotes triples containing P and T denotes all triples
Task 1a: Varying the Ratio of Repeated
Properties to Fixed ones.
 Test Queries – (dq0 to dq4)
-
Two star patterns with fixed subset of unique properties + varying #repeated
properties in the second star pattern (from 0 to 4).
- Overall #triple patterns increase from 8 to 12
:type
:publisher
:type
:name
:publisher
:date
:name
:date
:type
:publisher
dq0: 2 star pattern,
0 repeated properties.


dq1: 1 repeated props.
dq2: 2 repeated props.
dq3: 3 repeated props.
Black edge: arbitrary unique property
Red edge: repeated property
:name
:date
dq4: 2 star patterns,
4 repeated properties.
(:type, :publisher,
:name, :date)
Task 1a: Varying the Ratio of Repeated
Properties to Fixed ones.
1-join-per-cycle (SHARD)
1-star-join-per-cycle (Pig-Def)
all-star-joins-1-cycle (NTGA)
HDFS_WRITE (GB)
HDFS_READ (GB)
200
150
100
50
0
dq0 dq1 dq2 dq3 dq4
25
20
15
10
5
0
dq0 dq1 dq2 dq3 dq4
With increasing #repeated properties,
1. NTGA : Constant HDFS reads and execution time
: Less HDFS writes due to the fewer number of required MR jobs.
2. SHARD #the scans of the whole relations are increased.
3. Pig-Def or VP : #the scans of the property relations are increased.
00:00
Time (Seconds)
2500
2000
1500
1000
500
0
dq0 dq1 dq2 dq3 dq4
07:12
Pig-Def MR1
MR2
MR3
MR4
NTGA MR1
MR2
SHARD MR1
MR2
(…)
MR12
MR13
Pig-Def (4 MR cycles), NTGA(2 cycles), SHARD (13 cycles)
14:24
21:36
28:48
Task 1b: Varying the Size of Repeated Props
 Test Queries – rq1 and rq2
 Identical queries with two star subpatterns
 but contain a different repeated property.
- rq1 : :publisher - 1.7GB, 9.1%
- rq2 : :name - 49MB, 0.3%
- NTGA has around 42% performance gain over Pig-Def for rq2
and increases to around 48% gain for rq1.
- With rq2, Pig-Def always uses additional 70 seconds than rq1.
:publisher
:publisher
rq1: two star pattern with
repeated property :publisher
:name
rq2: two star pattern with
repeated property :name
:name
Task 2: Scaling up Sub patterns with Repeated
Properties
 Four queries (mq1 ~ mq4)
- Two repeated properties occur in each of the star subpatterns,
- Vary number of star patterns (1 to 4).
- The total number of repeated properties are increased across a
graph pattern query: from 2 (in mq1) to 8 (in mq4)
:type
:type
:type
:publisher
:publisher
:type
:publisher
:type
:publisher
:publisher
:type
:publisher
mq1: a single
star pattern
mq2: two
star patterns
mq3: three
star patterns
Task 2: Scaling up Sub patterns with Repeated
Properties
300
1-star-join-per-cycle (Pig-Def)
all-star-joins-1-cycle (NTGA)
Time (Second)
4000
HDFS_READ (GB)
1-join-per-cycle (SHARD)
250
200
150
3000
100
2000
≈40G
50
1000
0
mq1 mq2 mq3
mq4
≈ 80G
≈ 120G
0
mq1
mq2
mq3
mq4
 mq1 mq4: ↑ #star patterns → ↑ #repeated properties across star
patterns (from 2 to 8), ↑ #the amount of scan-sharing across star
patterns (from around 40G to 120G)
 Execution Time is increased due to join operations for
connecting sub stars.
Execution Time (in seconds)
Task 3: Varying Size of Graphs
2500
2000
Pig-Def
NTGA
1500
500
58%
55%
1000
52.8%
54.8%
0
BSBM-250k
(22GB)
BSBM-500k
(43GB)
BSBM-750k
(66GB)
BSBM-1000k
(86GB)
 Increases #RDF triples for query dq4 used in Task1.
 From BSBM-250k (22GB) to BSBM-1000k (86GB)
 NTGA approach scales well.
- Performance gain is observed from 52% to 58%
- The size of relations containing repeated properties are not
increased linearly when increasing the size of data
Related Work
RDF Data Processing on MapReduce:
SHARD[Rohloff10] :
 The clause-iteration algorithm (n +1 jobs to process n triple patterns)
HadoopDB[Huang11] :
 A hybrid architecture of database (RDF-3x) and Hadoop with a graph partitioning scheme.
HadoopRDF[Husain10] :
 A customized storage format and plan generation based on a heuristic greedy approach.
Work Sharing on MapReduce:
MRShare [NYKIEL10]:
 Inter-query sharing scheme customized into the MapReduce framework.
NOVA [Olston11]:
 Share the initial load operation if multiple copies of workflow use the identical input.
CoScan[Wang11]:
 Minimize redundant data loading by merging multiple Pig scripts.
Relevant Publications
 Kim, H., Ravindra, P., Anyanwu, K.: Scan-Sharing for
Optimizing RDF Graph Pattern Matching on MapReduce, In:
Proc. CLOUD (2012)
 Anyanwu, K., Kim, H., Ravindra, P., : Algebraic Optimization
for Processing Graph Pattern Queries in the Cloud, IEEE
Internet Computing (2012)
 Kim, H., Ravindra, P., Anyanwu, K.: From SPARQL to
MapReduce: The Journey using a Nested TripleGroup Algebra.
In: Proc. International Conference on Very Large Data Bases
(2011) – (Demonstration).
 Ravindra, P., Kim, H., Anyanwu, K.: An Intermediate Algebra
for Optimizing RDF Graph Pattern Matching on MapReduce
Platforms, In: Proc. Extended Semantic Web Conference
(2011)
References
[DEAN08] Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51 (2008)
107–113
[OLSTON08] Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data
processing. In: Proc. International Conference on Management of data. (2008)
[HUSAIN11] M. F. Husain, J. McGlothlin et al., “Heuristics-Based Query Processing for Large RDF Graphs Using Cloud
Computing,” TKDE, vol. 23, pp. 1312–1327, 2011.
[HUANG11] J. Huang, D. J. Abadi et al., “Scalable SPARQL Querying of Large RDF Graphs,” Proc. VLDB, vol. 4, no. 11,
2011.
[NYKIEL10] T. Nykiel, M. Potamias et al., “MRShare: Sharing across Multiple Queries in MapReduce,” Proc. VLDB, vol.
3, pp.494–505, 2010.
[OLSTON11] C. Olston, G. Chiou et al., “Nova: Continuous Pig/Hadoop Workflows,” in Proc. SIGMOD, 2011, pp. 1081–
1090.
[WANG11] X. Wang, C. Olston et al., “CoScan: Cooperative Scan Sharing in the Cloud,” in Proc. SOCC, 2011, pp. 11:1–
11:12.
[RAVINDRA11] P. Ravindra, H. Kim et al., “An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on
MapReduce,” in Proc. ESWC, 2011, vol. 6644, pp. 46–61.
[ABADI07] D. J. Abadi, A. Marcus et al., “Scalable Semantic Web data Management using Vertical Partitioning,” in Proc.
VLDB,2007.
[ROHLOFF10] K. Rohloff and R. E. Schantz, “High-performance, Massively Scalable Distributed Systems using the
MapReduce Software Framework: the SHARD Triple-store,” in PSI EtA, 2010, pp. 4:1–4:5.
[NEUMANN10] T. Neumann and G. Weikum, “The RDF-3X engine for scalable management of RDF data,” The VLDB
Journal, vol. 19, pp. 91–113, 2010.
[WEISS08] C. Weiss, P. Karras, and A. Bernstein.“Hexastore: Sextuple Indexing for Semantic Web Data Management”,
Proc. VLDB, vol. 1, no. 1, 2008.
[HERODOTOU11] H. Herodotou and S. Babu. “Profiling, What-if Analysis, and Cost-based Optimization of MapReduce
Programs.” Proc. VLDB, vol. 4, 2011
[BLANAS1010] S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. “A Comparison of Join Algorithms
for Log Processing in MapReduce.” Proc. SIGMOD, 2010.
Thank You!
RDF Data Model
(Resource Description Framework)
1. Statements (triples)
Subject
Property
2. Graph Representation
Object
:Product1
:name
“iphone4”
:Product1
:color
:white
:Product1
:date
“2011-10-14”
:Product1
:publisher
:Producer1
…
…
…
:Producer1 :name
“Apple”
:Producer1 :type
:Producer
:Producer1 :date
“1976-04-01”
:Producer1 :homepage
apple.com
:Product1
:date
:name
“2011-10-14”
:color
“iphone4”
:white
:publisher
:Producer1
:homepage
:name
:type
:date
Star subgraphs - set of edges with same
subject e.g. :Product1 and :Producer1,
“Apple”


:Producer
Oval: resources i.e. URIs
Rectangle: Literals
apple.com
“1976-04-01”
Relationship between TripleGroups and
n-tuples
 TripleGroups are not structurally equivalent to n-tuples but are
“content equivalent”.
1.TripleGroup in NTGA (TG_GroupBy
tg1 =
and TG_GroupFilter)
(:Product1, :type, :Product)
(:Product1, :date, “1976-04-01”),
(:Product1, :name, “iphone 4”)
 different structure BUT
“content equivalent”
2. n-tuple in VP (SPLIT
and JOIN)
(:Product1, :type, :Product, :Product1, :date, “1976-04-01”, :Product1, :name, “iphone 4”)
t1
t2
t3
NTGA Quick Reference
Consider, a set of Triplegroups TG = {tg1 , tg2 } such that
(:Prdct1, :name, “iphone4”),
(:Prdct1, :publisher, :prdcr1),
(:Prdct1, :price, “100”)
tg1 =
#
tg2 =
NTGA Operators
(:Prdcr1, :type, :Prdcr),
(:Prdcr1, :date, “1976-04-01”),
(:Prdcr1, :hpage, “apple.com”)
Result
1 TG_Flatten(tg1)
(:Prdct1, :name, “iphone4”, :Prdct1, :publisher, :Prdcr1,
:Prdct1, :price, 100)
2 TG_Join
ntg = {
(?o :publisher ?v: TG{:name, :publisher, :price}
?v :type
?t : TG{:type, :date, :hpage} )
3 TG_Unnest(ntg)
(:Prdct1, :name, “iphone4”),
(:Prdct1, :publisher, (:Prdcr1, :type, :Prdcr),
(:Prdcr1, :date, “1976-04-01”),
(:Prdcr1, :hpage, “apple.com”)
(:Prdct1, :price, “100”) }
(:Prdct1, :name, “iphone4”),
(:Prdct1, :publisher, :Prdcr1),
(:Prdcr1, :type, :Prdcr),
(:Prdcr1, :date, “1976-04-01”),
(:Prdcr1, :hpage, “apple.com”)
(:Prdct1, :price, “100”) }
Execution on MapReduce Platform
 MapReduce (MR): Popular large-scale data processing systems
of data running on a cluster of commodity grade machines [DEAN04]
 Encode tasks in terms of low level code as map/reduce
functions, which are executed in parallel across the cluster.
 Apache Hadoop* – open-source implementation
 Extended systems provides high-level languages for specifying
tasks along with optimizing compilers for generating
map/reduce code à la database systems.
 Pig Latin for Apache Pig**, HiveQL for Apache Hive***.
* http://hadoop.apache.org
Architecture of RAPID+
Query
Parser Layer
Architecture of RAPID+
Pig Latin parser
( …)
SPARQL parser
Logical Plan Generator/Optimizer
Pig Latin Plan
Generator
Query Analyzer
JOIN
LOAD
SPLIT
NTGA Plan
Generator
TG_GroupBy
TG_Join
STORE
JOIN
LOAD
TG_GroupFilter
Logical-to-Physical Plan Translator
MapReduce Job Compiler
Hadoop Job Tracker
STORE

Download Report

Presentation Title Goes Here

Paperzz.com

Your Paperzz