Data driven indexing in Glast

Data driven indexing in Glast
Marco Frailis
DC1 indexing at GSFC

Space driven indexing
Use of a level 2 HTM pixelization

The sky is divided into 128 regions

Each file is identified by:



HTM id + min time + max time
Search performed by 5 parallel processes

Each process sorts the resulting photons by
time

The 5 sets are merged into a fits file
1/9
VAMSplit R-Tree


A data driven indexing method

follows the data distribution

to efficiently index multidimensional data
An optimized and static R-tree structure

Data is bulk loaded into the tree

At each recursive step of the construction:

the dimension with maximum variance is calculated

the split is performed at a near median element
2/9
The R-tree (semplified)

Inner node format: (cp, Rectangle)

cp: the address of a child node in the R-tree

Rectangle: d-dimensional Minimum Bounding
Box of all the rectangle entries in that child node


Leaf node format: (Point, Attributes)

Point: coordinates in a d-dimensional space

Attributes: data associated to that point
Maximum number of entries in each node
limited by the disk page size
3/9
VAMSplit construction (1)

A near complete and balanced R-tree is
built

At each recursive step, child subtrees capacity
is calculated by:
ccap  B  F



 N 
 logF  B   1
 

where B is the bucket capacity, F the internal
fanout and N the number of elements
The near median element is selected by :
 N  1 
nm  ccap     
 ccap  2 
4/9
VAMSplit construction (2)

The near median element is selected on the
dimension with maximum variance

a better split strategy than selecting the
dimension with maximum spread

For large dataset, an external selection
algorithm is needed

pivot value found by a sampling method

caching method used for the partition step
5/9
Photon data index
6/9
Dettaglio sulla crab
7/9
Preliminary comparison





25 queries performed on one year of
simulated data (40,1 millions of good
photons)
165 bytes for each photon (instead of 90)
SCSI vs IDE disk
Circular vs rectangular search
DC1 index search time (one process):



30,86 s without sorting
76,56 s including sorting
VAMSplit R-Tree:

141,09 s (stima rozza e su macchine differenti)
Future work

Improve the data structure to help browsing and
analizing the data



Optimize the code
Different split strategies
Other node informations: exact number of elements in
its subtrees, variance, other moments

Add other dimensions to the index

i.e. time and energy (not possible with the HTM)

Possible gamma-ray burst application
9/9