Graph Indexing - Hasso-Plattner

Graph Indexing
DavideMottin,KonstantinaLazaridou
Hasso Plattner Institute
Graph Mining course Winter Semester 2016
HPI-Kolloquium – Graph summarization
Networksnaturallycaptureahostofreal-worldinteractions,
spanningfromsocialinteractionstobrainactivity.But,givena
massivegraph(e.g.,alargeemailexchangenetwork),whatcan
belearnedaboutitsstructure?Whatareits"important"
patterns?
InthistalkIwillpresentourworkonscalablealgorithmsthat
helpustomakesenseoflargegraphdataduringearly
Prof.Danai Koutra
exploratoryanalysis.Iwillfocusontwoapproaches:“VoG”and
“TimeCrunch”.VoG disentanglesthecomplexgraphconnectivity
patterns,andefficientlysummarizeslargegraphswithimportant
andsemanticallymeaningfulstructuresbyleveraging
information-theoreticmethods.TimeCrunch discoverscoherent
temporalpatterns,andsummarizestime-evolvingnetworksina
scalableandeffectiveway.Bothmethodsprovidekeyinsights
intolargereal-worldgraphs.
Iwillconcludethetalkbypresenting"Perseus",aninteractive
large-scalegraphminingandvisualizationtoolthatcanbeused
toobtainadifferenttypeofsummariesthatconsistofgraph
properties(orstatistics).
GRAPH MINING WS 2016
2
Lecture road
GraphIndexing
Feature-basedGraphIndex
StructureSimilaritySearch
GRAPH MINING WS 2016
3
Graph Search
Queryingasetofgraphs
Givenasetofgraphs𝐷 = {𝐺% , … , 𝐺( } andaquerygraph𝑄,findallgraphs𝐷+ =
{𝐺|𝑀𝑎𝑡𝑐ℎ 𝑄, 𝐺 = 1, 𝐺 ∈ 𝐷},where𝑀𝑎𝑡𝑐ℎ isaboolean function.
§ Matchisafunction:
• Subgraphisomorphismforsubstructuresearch
• Subgraphapproximatematchforsubstructuresimilaritysearch
Query
GRAPH MINING WS 2016
Setofgraphs(orgraphdatabase)
4
Scalability issues
§ Requiretoprocesslargegraphs/setofgraphs
§ Structuresearchishard(NP-completeforsubgraphisomorphism)
§ Graphsearchneedtobefast
§ RequiresalotofDiskI/Os
Needforefficientstructurestosummarizethegraphandprocessthequerieseasily
ThesestructuresarecalledGraphIndexes
GRAPH MINING WS 2016
5
Indexing strategies
§ Graphembedding
• Representeachgraphasavectoroffeatureandusehigh-dimensionalindexessuch
asR-trees[Petrakisetal.,1997]
• Useeigenvaluesofadjacencymatricesassignatures[Shokoufandeh etal.,1999]
§ Metricindexing
• Organizesgraphshierarchicallyaccordingtotheirmutualdistances[Berretti etal.,
2001]
• Usemaximumdescriptionlengthtocompress[Holderetal.,1994]
§ Feature-basedgraphIndex
• Representgraphsassetoffeatures(substructures)andindexthem.
GRAPH MINING WS 2016
6
Indexing strategy
Graph(G)
Query(Q)
IfgraphGcontainsquerygraph
Q,Gshouldcontainany
substructureofQ
Substructure
IndexsubstructuresofaquerygraphtoprunegraphsthatdoNOT contain
thesesubstructures(conservativestrategy)
GRAPH MINING WS 2016
7
Graph search with indexes
§ Twostepinprocessinggraphqueries
Step1.IndexConstruction
§ Enumeratestructuresinthegraphdatabase,buildaninverted
indexbetweenstructuresandgraphs
Step2.QueryProcessing
§ Enumeratestructuresinthequerygraph
§ Calculatethecandidategraphscontainingthesestructures
§ Removethefalsepositiveanswersbyperformingsubgraph
isomorphismtest
GRAPH MINING WS 2016
8
Graph search with indexes (2)
Query(Q)
Database
D={𝐺% , … 𝐺( }
Features
(1)Search
f1
𝐶+ = 5
6⊑+,6∈9
𝐷6
f2
Candidateset:setofgraphs
containingallthefeatures
- Fisthesetoffeatures
Retrievethecandidate
graphsfromthedisk
(2)Fetching
(3)Verification
Checkifthecandidatessatisfy
thequery
Eachnodeis
afeature
{G1,G2,..}
{G4,..}
Queryprocessing
GRAPH MINING WS 2016
Indexconstruction
9
Cost Analysis
DiskI/Otime
§ Queryresponsetime
𝑇;(<=> + 𝐶+ ∗ (𝑇;B + 𝑇;CBDBEFG;CD_I=CI;(J )
Retrieveindexinto
mainmemory
Numberof
candidategraphs
Isomorphismtestingtime
Remarks
• Timeforisomorphismtestdoesnotchangemuchwithqueries
• Indexingtimeisimportantiftheindexdoesnotfitintomemory
• Make 𝐶+ assmallaspossible
Indexestrytoreduceasmuchaspossiblethisset
GRAPH MINING WS 2016
10
Types of features
Differentapproachesusedifferenttypesoffeatures:
1. Paths
easytocomputeandtomanipulate
generatemanyfalsepositivecandidates
2. Trees
Easiertomanipulatethansubgraphs,andmoreefficient
Generatemorefalsepositivesthansubgraphs
3. Subgraphs
Generatefewercandidates
Complexstructures,generatebiggerindexes
GRAPH MINING WS 2016
11
Path-based Approaches
[James et al. 2003, Shasha et al. 2002]
1.Enumeratepathsofaspecificlength
0-length:C,O,N,S
1-length:C-C,C-O,C-N,N-N,S-O
2-length:C-C-C,C-O-C,C-N-C,…
3-length:...
2.Buildaninvertedindexbetweenpathsangraphs
𝑆M = 𝑎, 𝑏, 𝑐 , 𝑆O = 𝑎, 𝑏, 𝑐
𝑆MPM = 𝑎, 𝑏, 𝑐 , 𝑆MPQ = 𝑎, 𝑏, 𝑐 , …
𝑆MPMPM = 𝑎, 𝑏 , …
GRAPH MINING WS 2016
12
Path-based Approaches (2)
§ Givenaquery
§ Decomposeitintopathsandcomputetheintersectionamong
candidatesets
• 𝑆M = 𝑎, 𝑏, 𝑐 , 𝑆Q = 𝑎, 𝑏, 𝑐
• 𝑆MPM = 𝑎, 𝑏, 𝑐 , 𝑆MPQ = 𝑎, 𝑏, 𝑐 , …
• 𝑆MPQPM = 𝑎, 𝑏 , …
§ Theintersectionis{a,b}
§ Retrievegraphsa,b fromthediskandverifyiftheycontainthe
querygraph
GRAPH MINING WS 2016
13
Types of features
Differentapproachesusedifferenttypesoffeatures:
1. Paths(GraphGrep)
easytocomputeandtomanipulate
generatemanyfalsepositivecandidates
2. Trees(Tree+Δ,GCoding,GString)
Easiertomanipulatethansubgraphs,andmoreefficient
Generatemorefalsepositivesthansubgraphs
3. Subgraphs(gIndex,C-Tree,GDIndex,FG-Index,TurboISO)
Generatefewercandidates
Complexstructures,generatebiggerindexes
GRAPH MINING WS 2016
14
gIndex: a subgraph-based approach
§ Identifyfrequentstructuresinthedatabase,thefrequent
structuresaresubgraphsthatappearofteninthegraph
database
§ Pruneredundantfrequentstructurestomaintainasmallsetof
discriminativestructures
§ Createaninvertedindexbetweendiscriminativefrequent
structuresandgraphsinthedatabase
Yan,X.,Yu,P.S.andHan,J.Graphindexing:afrequentstructure-basedapproach.SIGMOD,2004.
GRAPH MINING WS 2016
15
Discriminative features
§ Allgraphscontainstructures:C,C-C,C-C-C
§ Whybotherindexingtheseredundantfrequentstructures?
• Storeonlythosestructuresthatprovidemoreinformationthanexistingstructures
Goodnews
Discriminativefeaturesaretwoordersofmagnitudelessthannormalsubgraph
features,andoneorderofmagnitudelessthanfrequentsubgraphs!
GRAPH MINING WS 2016
16
What is discriminative?
§ Minenotonlyfrequentstructuresbutalsouseful
• Givenasetoffeatures𝑓% , 𝑓T , … , 𝑓( andanewstructure𝑥 measure
theprobabilityofreconstructing𝑥 havingalreadyindexed
𝑓% , 𝑓T , … , 𝑓( .
𝑃 𝑥 𝑓% , 𝑓T , … , 𝑓( , 𝑓; ⊑ 𝑥
Whatistheadvantageofindexing𝒙? WhenPissmallenough,𝑥 isa
discriminativefeatureandshouldbeincludedintheindex
⋂; 𝐷6Z
1
𝛾> =
=
𝑃 𝑥 𝑓% , 𝑓T , … , 𝑓(
𝐷>
𝛾> iscalleddiscriminativeratioof𝑥.
Afeature𝑥 isdiscriminativeis𝛾> ≥ 𝛾\]^
GRAPH MINING WS 2016
17
Why Frequent Structures?
§ Wecannotindex(orevensearch)allofsubstructures
§ Largestructureswilllikelybeindexedwellbytheir
substructures
§ Size-increasingsupportthreshold
§ Useamonotonicallynondeacreasing function𝜓,thatgivena
featuresizereturnsthesupportthreshols.
• Afeature𝑓 isfrequentifandonlyif 𝐷6 ≥ 𝜓(𝑠𝑖𝑧𝑒 𝑓 )
minimum
support threshold Θ
support
Exponential
size
GRAPH MINING WS 2016
18
Reducing the number of intersections
§ Howtoreducethenumberoftimesafeature𝑥 ischeckedin
𝐶+ ∩ 𝐷> ?
§ Remark:ifafeature𝑓> ⊏ 𝑓g than𝐷6h ⊆ 𝐷6j .
§ Therefore𝐶+ ∩ 𝐷6h ∩ 𝐷6j = 𝐶+ ∩ 𝐷6h andwecanskipthe
comparisonwith𝑓>
§ Solution:checkfeaturesinatop-downfashionenumerating
subgraphsofthequeryfromthebiggertothesmaller!
GRAPH MINING WS 2016
19
Types of features
Differentapproachesusedifferenttypesoffeatures:
1. Paths
easytocomputeandtomanipulate
generatemanyfalsepositivecandidates
2. Trees
Easiertomanipulatethansubgraphs,andmoreefficient
Generatemorefalsepositivesthansubgraphs
3. Subgraphs
Generatefewercandidates
Complexstructures,generatebiggerindexes
GRAPH MINING WS 2016
20
Why trees?
§ Treefeaturesareeasiertocompare
• subgraphisomorphismcanbepolynomialonorderedtrees
§ Treefeaturesaremoreexpressivethanpaths
• Pathsgeneratemorecandidatesthantreessincetheyarelessrestrictive
§ Mostofthediscoveredfrequentpatternsaretrees!!!
• Frequenttree-featuresandgraph-featuressharesimilardistributionsandfrequent
tree-featureshavesimilarpruningpowerlikegraph-features
§ TreeminingcanbedonemuchmoreefficientlythangraphminingonG
GRAPH MINING WS 2016
21
Size of the feature vs number of features
for each feature type
|𝐹 l m | = |𝐹l | − |𝐹o |
|𝐹p m | = |𝐹p | − |𝐹l |
Treeandgraph
featureshavealmost
thesamedistribution
Aftersomepointyou
discovernonewpath
Mostofthe
featuresaretrees
Intuitions
• Allsubtreesofafrequentgrapharefrequent
• Thereislittlechancethatsubtreesoffrequentgraphgcoincidewiththoseof
frequentgraphg’,duetothestructuraldiversityandlabelvariety
GRAPH MINING WS 2016
22
The feature selection cost: Cfs
• Givenagraphdatabase,G,andaminimumsupportthreshold,
min_𝑠𝑢𝑝𝑝,todiscoverthefrequentfeaturesetF(FP/FT/FG)
fromG
Path
Tree
Graph
Isomorphism
O(n)
O(n)
P or NPC (?)
Sub-Isomorphism
O(n + m)
O(m3/2n/logm)
NP-Complete
§ Tree
• Agoodcompromisebetween
⁃ themoreexpressive,butcomputationallyhardergeneralgraph
⁃ thefasterbutlessexpressivepath
• Specializationofgeneralgraphavoidingundesirabletheoreticalpropertiesand
algorithmiccomplexityincurredbygraph
GRAPH MINING WS 2016
23
Tree + Δ
§ Tree-basedGraphIndexing
§ Δ :On-demandselectasmallnumberofdiscriminativegraphfeatureswithoutconductingcostlygraphminingbeforehand
§ Ordersofmagnitudesmallerinindexsize,butperformsmuch
betterthanexistingapproachesinindexingconstructionand
queryprocessing
Zhao,P.,Yu,J.X.andYu,P.S.Graphindexing:tree+delta<=graph.PVLDB,2007.
GRAPH MINING WS 2016
24
Candidate answer set Size: 𝐶+
§ Pruningpower𝑝𝑜𝑤𝑒𝑟(𝑓) ofafrequentfeaturefis
𝐷 − 𝐷6
𝑝𝑜𝑤𝑒𝑟 𝑓 =
𝐷
§ Thepruningpowerofafrequentfeatureset𝑆 = {𝑓% , 𝑓T , … , 𝑓( }
𝐷 − ⋂(;z% 𝐷6Z
𝑝𝑜𝑤𝑒𝑟 𝑆 =
𝐷
§ Theorem1:Givenafrequentgraph-feature𝑔,andletitsfrequent
sub-treesetbe𝑇(𝑔) = {𝑡% , … , 𝑡( }.Then,𝑝𝑜𝑤𝑒𝑟(𝑔) ≥
𝑝𝑜𝑤𝑒𝑟(𝑇(𝑔))
§ Theorem2:Givenafrequenttree-featuret,andletitsfrequentsubpathsetbe𝑃(𝑡) = {𝑝% , … , 𝑝D }.Then,𝑝𝑜𝑤𝑒𝑟(𝑡) ≥ 𝑝𝑜𝑤𝑒𝑟(𝑃(𝑡))
GRAPH MINING WS 2016
25
Pruning Power
• Thepruningpowerofallfrequentsubtreefeatures,𝑇(𝑔),of
afrequentgraph-featuregcanbesimilartothepruning
powerofg
• Thereisabiggapbetweenthepruningpowerofagraphfeaturegandthatofallitsfrequentsub-pathfeatures,P(g)
GRAPH MINING WS 2016
26
Discriminative Graph Features Δ
• Consideraquerygraphq whichcontainsasubgraphg
• If𝑝𝑜𝑤𝑒𝑟(𝑇(𝑔)) ≈ 𝑝𝑜𝑤𝑒𝑟(𝑔),thereisnoneedtoindexthegraphfeatureg,becauseitssubtreesjointlyhavethesimilarpruningpower
• if𝑝𝑜𝑤𝑒𝑟 𝑔 ≫ 𝑝𝑜𝑤𝑒𝑟(𝑇(𝑔)),itwillbenecessarytoselectgasanindex
featurebecausegismorediscriminativethan𝑇(𝑔),intermsofpruning
• Discriminativegraph-features(w.r.t. itssubtree-features,
controlledbyε0)areselectedfromquerieson-demand,
withoutminingthewholesetoffrequentgraph-featuresfrom
𝐷 beforehand
• Discriminativegraph-featuresareusedasadditionalindexingfeatures,
denotedΔ,whichcanalsobereusedfurthertoanswersubsequent
queries
GRAPH MINING WS 2016
27
Discriminative Graph Selection
§ Theoccurrenceprobabilityof𝑔 inthegraphdatabase𝐷 is
𝑃 𝑔 =
𝐷J
= 𝜎J
𝐷
§ theconditionaloccurrenceprobabilityofg’ ⊒ 𝑔,modelsthe
probabilitytoselectg’ from𝐷 inthepresenceof𝑔
𝐷J m
𝑃 𝑔 ∧ 𝑔‚
𝑃 𝑔′
𝑃 𝑔 𝑔 =
=
=
𝑃 𝑔
𝑃 𝑔
𝐷J
‚
Expressedonlyin
termsof𝑇(𝑔‚ )!!!
§ Theupper(similarly,thelower)boundof𝑃(𝑓’|𝑓)
‚
𝑃 𝑔 𝑔 =
𝐷J m
𝐷J
𝐷 − 𝐷l Jm
𝐷 −
1 − 𝜖‡
≤
𝜎𝐷
=
𝜎l
Jm
− 𝜖‡
(1 − 𝜖‡ )𝜎
§ Theconditionaloccurrenceprobabilityof𝑃(𝑔’|𝑔),issolelyupperboundedby𝑇(𝑔’)
GRAPH MINING WS 2016
28
Generic indexes
§ Lindex isanindexthatworksonanykindoffeature(paths,
graphs,stars,trees)
§ Organizeindexfeaturesinalattice(directedacyclicgraph)
§ Usesettheoryandgraphtheorytofastprunethegraph
Yuan,D.andMitra,P.Lindex:alattice-basedindexforgraphdatabases. VLDBJ,2013.
GRAPH MINING WS 2016
29
Lecture road
GraphIndexing
Feature-basedGraphIndex
StructureSimilaritySearch
GRAPH MINING WS 2016
30
Structure Similarity Search
• CHEMICAL COMPOUNDS
(a) caffeine
(b) diurobromine
(c) viagra
• QUERY GRAPH
Problem
Nomatchorveryfew
matchesforaquerygraph
GRAPH MINING WS 2016
31
Some “Straightforward” Methods
§ Method1:Directlycomputethesimilaritybetweenthe
graphsintheDBandthequerygraph
• Slow: Sequentialscan
• Slow:Subgraphsimilaritycomputation
§ Method2: Formasetofsubgraphqueriesfromtheoriginal
querygraphandusetheexactsubgraphsearch
• Exponential:Ifweallow3edgestobemissedina20edgequerygraph,itmaygenerate1,140subgraphs
GRAPH MINING WS 2016
32
Index: Precise vs. Approximate Search
§ PreciseSearch
• Usefrequentpatternsasindexingfeatures
• Selectfeaturesinthedatabasespacebasedontheirselectivity
• Buildtheindex
§ ApproximateSearch
• Hardtobuildindicescoveringsimilarsubgraphsà explosivenumberofsubgraphs
indatabases
Idea:
1. Useafeature-basedindexstructure
2. Selectfeaturesinthequeryspace(soselectfeaturessuchthatthey
miss1-2..edges)
GRAPH MINING WS 2016
33
Substructure Similarity Measure
§ Queryrelaxationmeasure:
• Numberofedgesthatcanberelabeled ormissed
• Problem: thepositionoftheseedgesisnotfixed
QUERY (Q)
…
GRAPH MINING WS 2016
34
Substructure Similarity Measure
§ Feature-basedsimilaritymeasure
• Eachgraph𝐺; isrepresentedasafeaturevectorF] = {𝑓%, … , 𝑓( }
• Thesimilarityisdefinedbythecomponent-wisedistance ofthe
vectorwiththequeryvector(usingthefrequenciesofeach
feature)
• Advantages
⁃ Easytoindex
⁃ Fast
⁃ Roughmeasure
Shasha,D.,Wang,J.T.andGiugno,R.Algorithmics andapplicationsoftreeandgraphsearching.PODS,2002.
GRAPH MINING WS 2016
35
Intuition: Feature-Based Similarity Search
Graph (G1)
IfgraphGcontainsthemajorpartof
aquerygraphQ,Gshouldsharea
numberofcommonfeatureswithQ
Query (Q)
Graph (G2)
Substructure
Problem
Givenarelaxationratio(i.e.,
numberofmissingedges),
calculatethemaximumnumber
offeaturesthatcanbemissed.
Atleastonefeatureshouldbe
contained
GRAPH MINING WS 2016
36
Feature-Graph Matrix
features
graphs in database
G1
G2
G3
G4
G5
f1
0
1
0
1
1
f2
0
1
0
0
1
f3
1
0
1
1
1
f4
1
0
0
0
1
f5
0
0
1
1
0
Invalidgraphs(morethan2featuresmissing)
Assume a query graph has 5 features and at most 2 features can
miss due to some relaxation threshold (computed analytically
from the number of missing edges)
GRAPH MINING WS 2016
37
Edge Relaxation – Feature Misses
§ Ifweallowkedgestoberelaxed,findJ suchthatJisthe
maximumnumberoffeaturestobehitbykedges—itbecomes
themaximumcoverageproblem(orsetk-coverproblem)
§ NP-complete
§ Agreedyalgorithmexistswithapproximationguarantee
𝐽JE==<g
1
≥ 1− 1−
𝑘
‹
.𝐽
• Wedesignaheuristictorefinetheboundoffeaturemisses
GRAPH MINING WS 2016
38
Edge-feature matrix
Features
Query
e1
e2
e3
f1
f2
f3
Theuserselectede1,e2,e3asrelaxable edges
Edges
Embeddings (matchesofthepatterninthequery
f1
f2(1)
f2(2)
f3(1)
f3(2)
f3(3)
f3(4)
e1
0
1
1
1
0
0
0
e2
1
1
0
0
1
0
1
e3
1
0
1
0
0
1
1
GRAPH MINING WS 2016
39
Finding the maximum number of features
that can be missed
§ Usingtheedge-featurematrix,considerallthefeaturesthatare
hitatleastktimes,wherekisthemaximumnumberof
relaxable edges
§ Useagreedystrategytoapproximatethevalue:
1.
2.
3.
4.
Selecttherowthathitsthelargestnumberofcolumns
Removestherowandthecolumnsthathavea1inthisrow
Repeat1-2untilkrowshavebeenremoved.
Returnthenumberofcolumnsremoved.
§ Improvethesolutionwithsearchingthemostselectivefeatures
(thosethatincludemoreedges)
GRAPH MINING WS 2016
40
Query Processing Framework
Step1.IndexConstruction
• Selectsmallstructuresasfeaturesinagraphdatabase,andbuildthe
feature-graphmatrix betweenthefeaturesandthegraphsinthedatabase
Step2.FeatureMissEstimation
• Determinetheindexedfeaturesbelongingtothequerygraph
• Calculatetheupperboundofthenumberoffeaturesthatcanbemissedfor
anapproximatematching,denotedbyJ
⁃ Onthequerygraph,notthegraphdatabase
Step3.QueryProcessing
• Usethefeature-graphmatrixtocalculatethedifferenceinthenumberof
featuresbetweengraphGandqueryQ,FG– FQ
• IfFG– FQ>J,discardG.Theremaininggraphsconstituteacandidateanswer
set
GRAPH MINING WS 2016
41
In the next episode …
Nodeclassification
Graphsummarization
Andmuchmore…
GRAPH MINING WS 2016
42
Questions?
GRAPH MINING WS 2016
43
References
§
§
§
§
§
§
§
§
§
Petrakis,E.G.M.andFaloutsos,A.,1997.Similaritysearchinginmedicalimagedatabases.
IEEETransactionsonKnowledgeandDataEngineering, 9(3),pp.435-447.
Shokoufandeh,A.,Dickinson,S.J.,Siddiqi,K.andZucker,S.W.,1999.Indexingusinga
spectralencodingoftopologicalstructure.In ComputerVisionandPatternRecognition,
1999.IEEEComputerSocietyConferenceon. (Vol.2).IEEE.
Berretti,S.,DelBimbo,A.andVicario,E.,2001.Efficientmatchingandindexingofgraph
modelsincontent-basedretrieval. IEEETransactionsonPatternAnalysisandMachine
Intelligence, 23(10),pp.1089-1105.
Holder,L.B.,Cook,D.J.andDjoko,S.,1994,July.SubstructureDiscoveryintheSUBDUE
System.InKDDworkshop (pp.169-180).
Yan,X.,Yu,P.S.andHan,J.,2004,June.Graphindexing:afrequentstructure-based
approach.SIGMO.
Shasha,D.,Wang,J.T.andGiugno,R.Algorithmics andapplicationsoftreeandgraph
searching.PODS,2002.
James,C.A.,Weininger,D.andDelany,J.,1995.Daylighttheorymanual. Daylightchemical
informationsystems, 3951.
Yuan,D.andMitra,P.Lindex:alattice-basedindexforgraphdatabases. VLDBJ,2013.
Zhao,P.,Yu,J.X.andYu,P.S.Graphindexing:tree+delta<=graph.PVLDB,2007.
GRAPH MINING WS 2016
44