ppt

Business Intelligence
on Complex Graph Data
Dritan Bleco
([email protected])
Yannis Kotidis
([email protected])
Department of Informatics
Athens University Of Economics and Business
BEWEB 2012
Berlin
Outline
• Motivation
• Graph Data Model
• Operators on Graph Data
• Querying Graph Records
• Query Rewrites
• Experiments
• Conclusions
Dritan Bleco
Motivational Example
• A Supply Chain Management (SCM) application
• Tracks the different routes that articles of a customer
order follows from production lines to the consumer
hands
• Multiple warehouses are located among the production
lines and the shipping points and can stage the products
while the order is being assembled
• RFID readers are used to keep track of the location of the
articles
• An order follows one or more paths so our web supply
chain application produces graph like data
Dritan Bleco
11
C :1
A
7
E :8
5
15
Start Node
A
A
C
C
B
B
E
E
E
C
D
F
F
43
B
9
End Node
B
C
C
E
E
D
E
D
F
G
H
G
H
Measure
15
11
1
7
5
9
8
4
2
43
6
9
4
2
4
D
9
F
G
4
6
Node
Location
A
Thessaloniki
C
Trikala
B
Lamia
D
Athens
E
Athens
F
Athens
G
Iraklion
H
Kalamata
H
Production Lines
Warehouses
Shipping Points
Dritan Bleco
43
C:1
11
7
A
E :8
5
15
B
Node
Location
A
Thessaloniki
C
Trikala
B
Lamia
D
Athens
E
Athens
F
Athens
G
Iraklion
H
Kalamata
2
4
9
D
9
F
G
4
6
H
Q1: What is the total order completion
time?
The longest path between nodes A and G,H
Q2: What is the total processing time for parts
that are shipped through warehouses located
in Athens?
The longest path between nodes A and G,H
considering only paths that transverse at least one location in
Athens.
Dritan Bleco
Aggregate Nodes
11
43
C:1
7
A
E :8
5
15
B
2
4
9
9
F
G
4
D U
6
H
Aggregate Node U coalesces Warehouses located in Athens
Set In(U) contains the set of nodes of U that have at least one incoming edge from
nodes that do not belong U: In(U)={D, E}
Set Out(u) contains nodes in U that have at least one outgoing edge towards a node
that does not belong to U: Out(U)={D,F}
A single node can be abstracted as an aggregate node whose internal structure is not
revealed to the query: E =[in(E) ,out(E)] :8
Dritan Bleco
(AE)
Path (AE]
1111
43
C:1
7
7
A
:8
EE:8
5
15
B
2
4
9
D
9
F
G
4
6
H
Different simple Paths
(ACE) starting from out(A) end ending to in(E) (internal measure 8 is not included)
(ACE] starting from out(A) end ending to out(E) (internal measure 8 is included)
Dritan Bleco
Node E - Path [EE]
11
43
C:1
7
A
:8
EE:8
5
15
B
2
4
9
D
9
F
G
4
6
H
Different simple Paths
(ACE) starting from out(A) end ending to in(E) (internal measure 8 is not included)
(ACE] starting from out(A) end ending to out(E) (internal measure 8 is included)
Starting from in(E) end ending to out(E) [in(E),out(E)]= E
Dritan Bleco
Composite Path [AE]*
1111
43
C:1
C:1
7
7
AA
15
15
:8
EE:8
55
B
2
4
9
D
9
F
G
4
6
H
Composite Paths:
Paths with same Starting and Ending Node
[A,E]* ={ [ACE], [ABE] }
Dritan Bleco
Composite Path [A in(u))*
1111
43
C:1
C:1
7
7
AA
15
15
E :8
55
B
2
99
4
9
F
G
4
D U
6
H
Composite Paths:
Paths with same Starting and Ending Node
[A,E]* ={ [ACE], [ABE] }
[A, in(U))* ={ [ACE),[ABE),[ABD)}
Dritan Bleco
Composite Path [in(U)out(U)]*
11
43
C:1
7
A
:8
EE:8
5
15
B
22
4
9
4
9
FF
G
4
D U
6
H
Composite Paths:
Paths with same Starting and Ending Node
[A,E]* ={ [ACE], [ABE] }
[A, in(U))* ={ [ACE),[ABE),[ABD)}
[in(U),out(U)]* ={ [EF],[ED] }
Dritan Bleco
Operators on Graph Data
1111
43
C:1
7
A
7
:8
EE:8
5
15
B
22
4
9
D
99
FF
G
G
4
6
H
Path-join operator concatenates two paths p1 and p2
1. Ending node of p1 is the same as the starting node of p2
2. One of the two paths is open-ended at the common end-point.
[ACE)
[EFG]= [ACEFG]
Dritan Bleco
Operators on Graph Data
1111
43
C:1
7
A
Pr
7
:8
EE:8
55
15
15
B
22
99
44
99
FF
G
G
44
66
D
D
U
Sr
H
H
Path-join operator concatenates two paths p1 and p2
1. Ending node of p1 is the same as the starting node of p2
2. One of the two paths is open-ended at the common end-point.
[ACE)
[Pr, in(U))
[EFG]= [ACEFG]
[in(U), out(U)]
(out(U), Sr]
Dritan Bleco
Operators on Graph Data
1111
43
C:1
7
A
7
E :8
55
15
15
B
2
99
4
D
9
F
G
4
6
H
πp(r) Path projection operator projects the record on the edges defined in
path p, while retaining their measures.
Π[ACE)(r)={(A,C):11, (C,C):1, (C,E):7 }
The projection of a record on a composite path is computed as a set
containing the projections into the constituent paths.
Π[AE)*(r)={ {(A,C):11, (C,C):1, (C,E):7 } , {(A,B):15, (B,E):5 } }
Dritan Bleco
BI on Graph Data
43
C:1
[ACE):19
11
7
A
[ABE):20
Pr B
15
2
E :8
5
4
9
9
F
G
Sr
4
D U
6
H
Intra-Path Aggregate Function Fp(r) : applied on the measures resulting from the
projection of record r on path p
Sum[ACE)(r)=[ACE):19
Sum[AE)*(r)={ [ACE):19 , [ABE):20}
Inter-Path Aggregate Function G(Fp(r) ): consolidates the result(s) obtained via InterPath aggregation.
Max(Sum[ACE)*(r) )=Max({ [ACE):19, [ABE):20 })=[ABE):20
Max(Sum[Pr, Sr]*(r)) returns the order completion time for the order depicted in record r
Dritan Bleco
Queries using operators
11
43
C:1
7
A
Pr
E :8
5
15
B
2
4
9
9
F
G
Sr
4
D U
6
H
Dritan Bleco
Query Rewrite
11
43
C:1
7
A
Pr
5
15
4
B
ΜΑΧ( SUM[Pr, in(U))
[in(U), out(U)]
Generally G(Fp=p1
SUM
p2 (r))
6
(out(U, Sr] (r))
H
H
H
=
SUM[in(U), out(U)] (r)
= G ( Fp1(r)
Sr
4
D
D
U
9
MAX (SUM[Pr, in(U))(r)
FF
2
:8
EE:8
9
G
G
SUMSUM(out(U, Sr] (r))
Fp2(r) ) : pushing intra-path on a path
Dritan Bleco
Query Rewrite
G
G
(FG]:9
43
C:1
[ACE):19
:8
[EF]:10
EE:8
F
A[ABE):20 [ED]:12 F
11
7
2
5
Pr
15
4
[ABD):24
B
ΜΑΧ( SUM[Pr, in(U))
9
[in(U), out(U)]
MAX (SUM[Pr, in(U))(r)
SUM
9
4
(FH]:4
(DH]:6 H
D
D
U
[DD]:0
(out(U, Sr] (r))
6
H
=
SUM[in(U), out(U)] (r)
MAX ({[ABE):20 ,[ACE):19,[ABD):24}
(FH]:4, (DH]:6})
Sr
SUM{
SUMSUM(out(U, Sr] (r))
[EF] :10,[ED]:12,[DD]:0}
SUM{
(FG] :9,
MAX( {[ABEFG]:39, [ACEFG]:38, [ABEFH]:34, [ACEFH]:33, [ABEDH]:38, [ABEDH]:37
[ABDH]:30} ) = [ABEFG]:39
Dritan Bleco
Experiments (I)
• Two real Schema Graphs:
1. * BAY: Depicts San Francisco Bay Area roads and
2. **Gnutella: Describes connections among Gnutella hosts from August
2002.
• 120 million records are synthesized and assigned random real values
to the labels of each record.
• Experimental evaluation using the PBS (Pick By Size)
• Queries 50% intra-path and 50% inter-path chosen with zipf or unif.
• Independent evaluation of the Cost via the total number of tuples
that need to be retrieved
•
•
* http://www.dis.uniroma1.it/~challenge9/download.shtml
** http://snap.stanford.edu/data/p2p-Gnutella05.html
Experiments (II)
PBS, Bay Data Set, Uniform 100 Queries
PBS, Bay Data Set, Zipf 100 Queries
PBS-1 considers only intra-path materialized aggregates
PBS-2 considers only inter-path materialized aggregates
PBS selects and materializes both types of views depending on the query workload
Dritan Bleco
Experiments (III)
PBS, Gnutella Data Set, Uniform 100 Queries
PBS, Gnutella Data Set, Zipf 100 Queries
PBS-1 considers only intra-path materialized aggregates
PBS-2 considers only inter-path materialized aggregates
PBS selects and materializes both types of views depending on the query workload
Dritan Bleco
Experiments (IV)
Varying Query Mix, BAY Data Set,
Uniform Queries
Varying Query Mix, BAY Data Set,
Zipf Queries
Mix of intra-/inter-path queries in the BAY dataset for a fixed budget of 20%.
For inter-paths queries PBS and PBS-2 have the same performance
For only intra-path queries PBS-1 and PBS give the best performance.
PBS that considers both types of views provides consistently the largest reduction in
query cost.
Dritan Bleco
Conclusions
• A framework for modeling analytical queries in a graph
database independent of the
• underlying storage representation of the records
• the query language used
• Our framework
• Permits rewriting of complex aggregations into smaller
computational units
• Enables cost-based query optimization and precomputation of frequently used calculations.
• Experimental results show that proper selection of
materialized views can provide substantial gains in a large data
warehouse containing millions of graph records.
Dritan Bleco
Thank you,
Questions?
Dritan Bleco