APACHE SLING & FRIENDS TECH MEETUP BERLIN, 26-28 SEPTEMBER 2012 Efficient content structures and queries in CRX Marcel Reutegger Agenda Jackrabbit & CRX basics Efficient content structures and limitations of current implementation Query performance analysis and optimization adaptTo() 2012 2 Jackrabbit basics Nodes & properties stored in one entity -> bundle Every node/bundle has a UUID (random) Child nodes are linked from the parent node Binaries go into the DataStore adaptTo() 2012 3 Jackrabbit basics Bundle structure Bundle UUID Child node references Name / UUID Parent UUID Properties Name / UUID Name / UUID Name / Value Name / UUID Name / Value Name / UUID Name / Value adaptTo() 2012 4 Jackrabbit basics Binaries go into the DataStore Size threshold >= 4kB, otherwise inlined in bundle Content addressable storage, hash of content identifies binary DataStore garbage collection Cost to run is linear to the number of nodes in repository adaptTo() 2012 5 TarPM basics Nodes & Properties (bundles) stored in tar files Tar files are append only Data is never overwritten Garbage is removed by TarPM optimization (scheduled, incremental) adaptTo() 2012 6 TarPM basics TarPM index index_1_0.tar (data.txt, sorted) UUID UUID data_0000.tar bundle bundle bundle bundle UUID UUID data_0001.tar UUID bundle UUID bundle UUID bundle bundle adaptTo() 2012 7 Efficient content structures and limitations of current implementation Number of nodes adaptTo() 2012 8 Number of nodes Keep number of nodes low Performance degrades with increasing number of nodes Random UUIDs cause random I/O -> Jackrabbit design 15k rpm drive: 200-400 IOPS What about locality in data tar files? adaptTo() 2012 9 Number of nodes data_0000.tar index_1_0.tar (data.txt, sorted) UUID UUID bundle bundle bundle bundle UUID UUID Child nodes data_0001.tar UUID bundle UUID bundle UUID bundle bundle adaptTo() 2012 10 Number of nodes data_0000.tar index_1_0.tar (data.txt, sorted) UUID UUID bundle bundle bundle bundle UUID UUID 1 Child nodes data_0001.tar UUID bundle UUID bundle UUID bundle bundle adaptTo() 2012 11 Number of nodes data_0000.tar index_1_0.tar (data.txt, sorted) UUID UUID bundle bundle bundle bundle UUID Child nodes 2 UUID 1 UUID bundle UUID bundle UUID bundle data_0001.tar bundle adaptTo() 2012 12 Number of nodes data_0000.tar index_1_0.tar (data.txt, sorted) UUID UUID bundle bundle bundle bundle UUID Child nodes 2 UUID 1 UUID bundle UUID bundle UUID bundle 3 data_0001.tar bundle adaptTo() 2012 13 Number of nodes data_0000.tar index_1_0.tar (data.txt, sorted) 4 UUID UUID bundle bundle bundle bundle UUID Child nodes 2 UUID 1 UUID bundle UUID bundle UUID bundle 3 data_0001.tar bundle adaptTo() 2012 14 Number of nodes What about OS buffer cache? Cache is filled on demand Only helps to some degree Tar index file sizes (64 bytes per bundle) 1 million nodes: 70 MB 10 million nodes: 700 MB 100 million nodes: 7 GB adaptTo() 2012 15 Number of nodes How to reduce number of nodes Use version purge tool Remove archived workflow instances Purge audit events Application specific Bad: document view ‘import’ of XML Good: Pack properties on few nodes Other benefits: DataStore GC will be faster adaptTo() 2012 16 Number of nodes Other options: Solid state drive (~100k IOPS) Force OS to cache TarPM index files adaptTo() 2012 17 Efficient content structures and limitations of current implementation Number of child nodes adaptTo() 2012 18 Number of child nodes Frequently asked questions: «What is the maximum supported number of child nodes?» «I have X number of child nodes. Will performance be OK?» adaptTo() 2012 19 Number of child nodes Frequently asked questions: «What is the maximum supported number of child nodes?» «I have X number of child nodes. Will performance be OK?» adaptTo() 2012 It depends! 20 Number of child nodes Maximum number of child nodes Bundle UUID Child node references Name / UUID Parent UUID Properties Name / UUID Name / UUID Name / Value Name / UUID Name / Value Name / UUID Name / Value adaptTo() 2012 21 Number of child nodes Maximum number of child nodes Bundle UUID Child node references Name / UUID Parent UUID Properties Name / UUID Name / UUID Name / Value Name / UUID Name / Value Name / UUID Heap is the limit Name / Value adaptTo() 2012 22 Number of child nodes Adding a single child node adaptTo() 2012 23 Number of child nodes Large number of child nodes OK for: Static content /libs/wcm/core/i18n/de has ~4k child nodes Not OK for: Dynamic content authentication pins, replication items, user generated content adaptTo() 2012 24 Number of child nodes – Recommendations Structure content E.g. date/time based: 2012/09/26 Use utilities like Jackrabbit BTreeManager Make sure application keeps number of child nodes within limits (e.g. 1000) Save in batches when possible adaptTo() 2012 25 Number of child nodes What about performance? Usually repository growth is the major concern, but... Unfortunate combination of application and content design may result in bad performance adaptTo() 2012 26 Efficient content structures and limitations of current implementation David’s Model adaptTo() 2012 27 David‘s Model: A guide for content modeling Rule #1: Data First, Structure Later. Maybe. Rule #2: Drive the content hierarchy, don't let it happen. Rule #3: Workspaces are for clone(), merge() and update(). Rule #4: Beware of Same Name Siblings. Rule #5: References considered harmful. Rule #6: Files are Files are Files. Rule #7: ID's are evil. adaptTo() 2012 http://wiki.apache.org/jackrabbit/DavidsModel 28 A guide for content modeling - Appendix Avoid features not used in CQ adaptTo() 2012 XA transactions Shareable nodes Lifecycle Management Retention and Hold 29 Query performance analysis and optimization Query analysis adaptTo() 2012 30 Query performance analysis and optimization Query debug log http://dev.day.com/kb/home/Crx/Troubleshooti ng/HowToDebugJCRQueries.html “executed in <time> ms. (<query>)” JMX (CQ 5.5) QueryStat: slow and most frequent queries TimeSeries: count, duration, average adaptTo() 2012 31 Query performance analysis and optimization Fast: simple comparison sling:resourceType = ‘my/type’ Fast: node type match //element(*, nt:hierarchyNode) Fast: simple fulltext search jcr:contains(@jcr:title, ‘crx’) Fast: like on few distinct values jcr:like(@jcr:mimeType, ‘%/plain’) adaptTo() 2012 32 Query performance analysis and optimization Slower: path constraints content/geometrixx/en//*[ ... ] Alternative: turn path into property constraint. E.g. keep language property on every page and write: //*[@language = ‘en’] Slower: relative path in predicate //element(*, cq:Page)[jcr:contains(jcr:content, ‘crx’)] Alternative: shorten path in predicate and post process result: //element(*, cq:PageContent)[jcr:contains(., ‘crx’)] adaptTo() 2012 33 Query performance analysis and optimization Slower: jcr:contains with wildcards jcr:contains(., ‘sing*’) Alternative: Implement Lucene analyzer with appropriate stemmer adaptTo() 2012 34 Query performance analysis and optimization Slow: jcr:contains with initial wildcard jcr:contains(., ‘*rabbit’) Alternative: don’t do it, unless you know exactly what you are doing! Slow: jcr:like on many distinct values jcr:like(@email, ‘%@gmail.com’) Alternative: store data you want to query in separate property, then you can write: @email-host = ‘gmail.com’ adaptTo() 2012 35 Query performance analysis and optimization Slow: ranges matching many distinct values @jcr:lastModified > xs:dateTime(‘2001-0917T18:17:13.000+02:00') Alternative: reduce resolution (e.g. only store date and not time) adaptTo() 2012 36 Query performance analysis and optimization Query result does lazy loading of nodes Query.execute() may return quickly even if result size is big Looping over QueryResult.getNodes()/getRows() will load data from TarPM Reading a large result set completely is always slow Time to get result: query execution time + node retrieval time adaptTo() 2012 37 Query performance analysis and optimization Recommendations adaptTo() 2012 Test with real content Structure content to avoid queries Denormalize Avoid path constraints Replace frequent queries with initial query + event listener 38 Efficient content structures and queries in CRX Thank you adaptTo() 2012 39
© Copyright 2026 Paperzz