Data driven indexing in Glast Marco Frailis DC1 indexing at GSFC Space driven indexing Use of a level 2 HTM pixelization The sky is divided into 128 regions Each file is identified by: HTM id + min time + max time Search performed by 5 parallel processes Each process sorts the resulting photons by time The 5 sets are merged into a fits file 1/9 VAMSplit R-Tree A data driven indexing method follows the data distribution to efficiently index multidimensional data An optimized and static R-tree structure Data is bulk loaded into the tree At each recursive step of the construction: the dimension with maximum variance is calculated the split is performed at a near median element 2/9 The R-tree (semplified) Inner node format: (cp, Rectangle) cp: the address of a child node in the R-tree Rectangle: d-dimensional Minimum Bounding Box of all the rectangle entries in that child node Leaf node format: (Point, Attributes) Point: coordinates in a d-dimensional space Attributes: data associated to that point Maximum number of entries in each node limited by the disk page size 3/9 VAMSplit construction (1) A near complete and balanced R-tree is built At each recursive step, child subtrees capacity is calculated by: ccap B F N logF B 1 where B is the bucket capacity, F the internal fanout and N the number of elements The near median element is selected by : N 1 nm ccap ccap 2 4/9 VAMSplit construction (2) The near median element is selected on the dimension with maximum variance a better split strategy than selecting the dimension with maximum spread For large dataset, an external selection algorithm is needed pivot value found by a sampling method caching method used for the partition step 5/9 Photon data index 6/9 Dettaglio sulla crab 7/9 Preliminary comparison 25 queries performed on one year of simulated data (40,1 millions of good photons) 165 bytes for each photon (instead of 90) SCSI vs IDE disk Circular vs rectangular search DC1 index search time (one process): 30,86 s without sorting 76,56 s including sorting VAMSplit R-Tree: 141,09 s (stima rozza e su macchine differenti) Future work Improve the data structure to help browsing and analizing the data Optimize the code Different split strategies Other node informations: exact number of elements in its subtrees, variance, other moments Add other dimensions to the index i.e. time and energy (not possible with the HTM) Possible gamma-ray burst application 9/9
© Copyright 2026 Paperzz