Efficient content structures and queries in CRX

APACHE SLING & FRIENDS TECH MEETUP
BERLIN, 26-28 SEPTEMBER 2012
Efficient content structures and queries in CRX
Marcel Reutegger
Agenda
 Jackrabbit & CRX basics
 Efficient content structures and
limitations of current implementation
 Query performance analysis and
optimization
adaptTo() 2012
2
Jackrabbit basics
 Nodes & properties stored in one entity ->
bundle
 Every node/bundle has a UUID (random)
 Child nodes are linked from the parent
node
 Binaries go into the DataStore
adaptTo() 2012
3
Jackrabbit basics
 Bundle structure
Bundle
UUID
Child node references
Name / UUID
Parent UUID
Properties
Name / UUID
Name / UUID
Name / Value
Name / UUID
Name / Value
Name / UUID
Name / Value
adaptTo() 2012
4
Jackrabbit basics
 Binaries go into the DataStore
 Size threshold >= 4kB, otherwise inlined in bundle
 Content addressable storage, hash of content
identifies binary
 DataStore garbage collection
 Cost to run is linear to the number of nodes in
repository
adaptTo() 2012
5
TarPM basics
 Nodes & Properties (bundles) stored in tar
files
 Tar files are append only
 Data is never overwritten
 Garbage is removed by TarPM optimization
(scheduled, incremental)
adaptTo() 2012
6
TarPM basics
 TarPM index
index_1_0.tar (data.txt,
sorted)
UUID
UUID
data_0000.tar
bundle
bundle
bundle
bundle
UUID
UUID
data_0001.tar
UUID
bundle
UUID
bundle
UUID
bundle
bundle
adaptTo() 2012
7
Efficient content structures and limitations of
current implementation
Number of nodes
adaptTo() 2012
8
Number of nodes
 Keep number of nodes low
 Performance degrades with increasing number of
nodes
 Random UUIDs cause random I/O -> Jackrabbit
design
 15k rpm drive: 200-400 IOPS
 What about locality in data tar files?
adaptTo() 2012
9
Number of nodes
data_0000.tar
index_1_0.tar (data.txt,
sorted)
UUID
UUID
bundle
bundle
bundle
bundle
UUID
UUID
Child nodes
data_0001.tar
UUID
bundle
UUID
bundle
UUID
bundle
bundle
adaptTo() 2012
10
Number of nodes
data_0000.tar
index_1_0.tar (data.txt,
sorted)
UUID
UUID
bundle
bundle
bundle
bundle
UUID
UUID
1
Child nodes
data_0001.tar
UUID
bundle
UUID
bundle
UUID
bundle
bundle
adaptTo() 2012
11
Number of nodes
data_0000.tar
index_1_0.tar (data.txt,
sorted)
UUID
UUID
bundle
bundle
bundle
bundle
UUID
Child nodes
2
UUID
1
UUID
bundle
UUID
bundle
UUID
bundle
data_0001.tar
bundle
adaptTo() 2012
12
Number of nodes
data_0000.tar
index_1_0.tar (data.txt,
sorted)
UUID
UUID
bundle
bundle
bundle
bundle
UUID
Child nodes
2
UUID
1
UUID
bundle
UUID
bundle
UUID
bundle
3
data_0001.tar
bundle
adaptTo() 2012
13
Number of nodes
data_0000.tar
index_1_0.tar (data.txt,
sorted)
4
UUID
UUID
bundle
bundle
bundle
bundle
UUID
Child nodes
2
UUID
1
UUID
bundle
UUID
bundle
UUID
bundle
3
data_0001.tar
bundle
adaptTo() 2012
14
Number of nodes
 What about OS buffer cache?
 Cache is filled on demand
 Only helps to some degree
 Tar index file sizes (64 bytes per bundle)
 1 million nodes: 70 MB
 10 million nodes: 700 MB
 100 million nodes: 7 GB
adaptTo() 2012
15
Number of nodes
 How to reduce number of nodes
 Use version purge tool
 Remove archived workflow instances
 Purge audit events
 Application specific
 Bad: document view ‘import’ of XML
 Good: Pack properties on few nodes
 Other benefits: DataStore GC will be faster
adaptTo() 2012
16
Number of nodes
 Other options:
 Solid state drive (~100k IOPS)
 Force OS to cache TarPM index files
adaptTo() 2012
17
Efficient content structures and limitations of
current implementation
Number of child
nodes
adaptTo() 2012
18
Number of child nodes
 Frequently asked questions:
 «What is the maximum supported number
of child nodes?»
 «I have X number of child nodes. Will
performance be OK?»
adaptTo() 2012
19
Number of child nodes
 Frequently asked questions:
 «What is the maximum supported number
of child nodes?»
 «I have X number of child nodes. Will
performance be OK?»
adaptTo() 2012
It depends!
20
Number of child nodes
 Maximum number of child nodes
Bundle
UUID
Child node references
Name / UUID
Parent UUID
Properties
Name / UUID
Name / UUID
Name / Value
Name / UUID
Name / Value
Name / UUID
Name / Value
adaptTo() 2012
21
Number of child nodes
 Maximum number of child nodes
Bundle
UUID
Child node references
Name / UUID
Parent UUID
Properties
Name / UUID
Name / UUID
Name / Value
Name / UUID
Name / Value
Name / UUID
Heap is
the limit
Name / Value
adaptTo() 2012
22
Number of child nodes
 Adding a single child node
adaptTo() 2012
23
Number of child nodes
 Large number of child nodes
 OK for:
 Static content
 /libs/wcm/core/i18n/de has ~4k child nodes
 Not OK for:
 Dynamic content
 authentication pins, replication items, user
generated content
adaptTo() 2012
24
Number of child nodes – Recommendations
 Structure content
 E.g. date/time based: 2012/09/26
 Use utilities like Jackrabbit BTreeManager
 Make sure application keeps number of
child nodes within limits (e.g. 1000)
 Save in batches when possible
adaptTo() 2012
25
Number of child nodes
 What about performance?
 Usually repository growth is the major
concern, but...
 Unfortunate combination of application
and content design may result in bad
performance
adaptTo() 2012
26
Efficient content structures and limitations of
current implementation
David’s Model
adaptTo() 2012
27
David‘s Model: A guide for content modeling
 Rule #1: Data First, Structure Later. Maybe.
 Rule #2: Drive the content hierarchy, don't
let it happen.
 Rule #3: Workspaces are for clone(),
merge() and update().
 Rule #4: Beware of Same Name Siblings.
 Rule #5: References considered harmful.
 Rule #6: Files are Files are Files.
 Rule #7: ID's are evil.
adaptTo() 2012
 http://wiki.apache.org/jackrabbit/DavidsModel
28
A guide for content modeling - Appendix
 Avoid features not used in CQ




adaptTo() 2012
XA transactions
Shareable nodes
Lifecycle Management
Retention and Hold
29
Query performance analysis and optimization
Query analysis
adaptTo() 2012
30
Query performance analysis and optimization
 Query debug log
 http://dev.day.com/kb/home/Crx/Troubleshooti
ng/HowToDebugJCRQueries.html
 “executed in <time> ms. (<query>)”
 JMX (CQ 5.5)
 QueryStat: slow and most frequent queries
 TimeSeries: count, duration, average
adaptTo() 2012
31
Query performance analysis and optimization
 Fast: simple comparison
 sling:resourceType = ‘my/type’
 Fast: node type match
 //element(*, nt:hierarchyNode)
 Fast: simple fulltext search
 jcr:contains(@jcr:title, ‘crx’)
 Fast: like on few distinct values
 jcr:like(@jcr:mimeType, ‘%/plain’)
adaptTo() 2012
32
Query performance analysis and optimization
 Slower: path constraints
 content/geometrixx/en//*[ ... ]
 Alternative: turn path into property constraint. E.g.
keep language property on every page and write:
//*[@language = ‘en’]
 Slower: relative path in predicate
 //element(*, cq:Page)[jcr:contains(jcr:content,
‘crx’)]
 Alternative: shorten path in predicate and post
process result:
//element(*, cq:PageContent)[jcr:contains(., ‘crx’)]
adaptTo() 2012
33
Query performance analysis and optimization
 Slower: jcr:contains with wildcards
 jcr:contains(., ‘sing*’)
 Alternative: Implement Lucene analyzer with
appropriate stemmer
adaptTo() 2012
34
Query performance analysis and optimization
 Slow: jcr:contains with initial wildcard
 jcr:contains(., ‘*rabbit’)
 Alternative: don’t do it, unless you know exactly
what you are doing!
 Slow: jcr:like on many distinct values
 jcr:like(@email, ‘%@gmail.com’)
 Alternative: store data you want to query in
separate property,
then you can write: @email-host = ‘gmail.com’
adaptTo() 2012
35
Query performance analysis and optimization
 Slow: ranges matching many distinct
values
 @jcr:lastModified > xs:dateTime(‘2001-0917T18:17:13.000+02:00')
 Alternative: reduce resolution (e.g. only store
date and not time)
adaptTo() 2012
36
Query performance analysis and optimization
 Query result does lazy loading of nodes
 Query.execute() may return quickly even if
result size is big
 Looping over
QueryResult.getNodes()/getRows() will
load data from TarPM
 Reading a large result set completely is
always slow
 Time to get result: query execution time +
node retrieval time
adaptTo() 2012
37
Query performance analysis and optimization
 Recommendations





adaptTo() 2012
Test with real content
Structure content to avoid queries
Denormalize
Avoid path constraints
Replace frequent queries with initial query +
event listener
38
Efficient content structures and queries in CRX
Thank you
adaptTo() 2012
39