Triggering events with GPGPU in ATLAS L. Rinaldi (Università di Bologna and INFN) Institutes and people • • • • • • • • • SMU, Dallas, TX, USA INESC-ID, Universidade de Lisboa, Portugal LIP, Lisboa, Portugal STFC and Rutherford Appleton Lab, GB Sapienza Universita` di Roma, Italy AGH Univ. of Science and Technology, Krakow, Poland Istituto Nazionale di Fisica Nucleare (INFN-Bologna), Italy Universita` di Bologna, Italy INFN-CNAF, Italy Italian people: L. Rinaldi (BO), M. Bauce (RM1), A. Messina (RM1), M. Negrini (BO), A. Sidoti (BO), S. Tupputi (CNAF) Not-ATLAS but helping a lot: F. Semeria (INFN-BO), M. Belgiovine (UNIBO) 27/05/15 L. Rinaldi – Triggering with GPU in ATLAS 1 The ATLAS trigger and DAQ • Composed of hardware based Level-1 (L1) and software based High Level Trigger(HLT) • Reduces 40MHz input event rate to about 1kHz ( ~1500MB/s output) • L1 identifies interesting activity in small geometrical regions of detector called Region of Interests (RoI) • RoIs identified by L1 are passed to the HLT for event selection 27/05/15 With about 25k processes ~25k/100kHz = ~250ms/Event decision time L. Rinaldi – Trigger con GPU in ATLAS 2 High Level Trigger (HLT) • • • • • • • HLT uses Stepwise processing of sequences called Trigger Chains(TC) Each chain is composed of Algorithms and seeded by RoIs from L1 Same RoI may seed multiple chains Same Algorithm may run on different data An algorithm runs only once on same data (caching) If a RoI fails a selection step further algorithms in chain are not executed Initial algorithms (L2) work on partial event data (2-6%). Later algorithms have access to full event data 27/05/15 L1 ROI L1 ROI L1 ROI L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg Event Building EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg L. Rinaldi – Trigger con GPU in ATLAS 3 The ATLAS Detector Run- 1 CPU measurements Combined HLT: • 60% Tracking • 20% Calo • 10% Muon • 10% other ATLAS is developing and expanding a demonstrator exploiting GPUs in: • Inner Detector (ID)-Tracking • Calorimeter-Topo-clustering • Muon-Tracking to evaluate the potential benefit of GPUs in terms of throughput per unit cost 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS 4 ATLAS GPU Demonstrator • Increasing LHC instantaneous luminosity is leading to higher pileup • CPU time rises rapidly with pile-up due to combinatorial nature of the tracking • HLT farm size is limited, mainly by power and cooling • GPUs provide good power/compute ratio 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS 5 Offloading Mechanism • Trigger Processing Units integrate ATLAS offline software framework, ATHENA, to online environment • Many PU processes run on each Trigger host • A client-server approach is implemented to manage resources between multiple PU processes. • PU prepares data to be processed and sends it to server • Accelerator Process Extension (APE) Server manages offload requests and executes kernels on GPU(s) • It sends results back to process that made the offload request 27/05/15 Trigger Trigger Trigger PU PU Trigger PU (ATHEN (ATHEN PUA) (ATHEN A) ) (ATHENA A) Data+Metadata Results+Metada ta APE Server Server can support different hardware types (GPUs, Xeon-Phi, CPUs) and different configurations such as GPU/Phi mixtures and in-host off-host accelerators. L. Rinaldi – Trigger con GPU in ATLAS 6 Offloading - Client 3. 4. 5. 27/05/15 HLT Algoritm 1 Athena EDM 5 Data+MetaData TrigDetAccelSvc OffloadSvc 3 2 GPU EDM Export Tool Export Tool Export tools and server communication need to be fast (serial sections) L. Rinaldi – Trigger con GPU in ATLAS 4 Result 2. HLT Algorithm asks for offload to TrigDetAccelSvc TrigDetAccelSvc converts C++ classes for raw and reconstructed quantities from the Athena Event Data Model (EDM) to GPU optimized EDM through Data Export Tools Then it adds metadata and requests offload through OffloadSvc OffloadSvc manages multiple requests and communication with APE server Results are converted back to Athena EDM by TrigDetAccelSvc and handed to requesting algorithm Request 1. Implemented for each detector such as TrigInDetAccelSvc APE Server 7 Offloading - Server • • • • APE Server uses plug-in mechanism It is composed of Manager handles communication with processes and scheduling – Receives offload requests – Passes the request to appropriate Module – Executes Work items – Sends the results back to requesting process Modules manage GPU resources and create Work items – Manage constants and time-varying data on GPU – Bunch multiple requests to optimize utilization – Create appropriate work items Work items are composed of the kernels which run on GPU and prepare results Athena Data + Metadata Work Module Todo Queue APE Server Manager Complete d Queue Athena Results + Metadata Each detector implements their own Module 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS 8 ID tracking Module Tracking is most time consuming part of trigger, done in four steps ByteStream Decoding Hit Clustering Track Formation Clone Removal Each thread works on a word and decodes it. Data contains hits on on different ID layers Bytestream decoding converts detector encoded output to hits coordinates Charged particles typically activate one or more neighboring strips or pixels. Hit clustering merges these by a Cellular Automata algorithm 27/05/15 Each hit start as an independent cluster. Adjacent clusters are merged in each step until all adjacent cells belong to same cluster. Each thread works on a different hit L. Rinaldi – Trigger con GPU in ATLAS 9 ID tracking Module Tracking starts with pair forming in inner layers. A 2D thread array checks for pairing condition and selects suitable pairs Then these pairs are extrapolated to outer layers by a 2D thread block to form triplets. Finally triplets are combined to form Track Candidates. In clone removal step, track candidates starting from different pairs but having same outer layer hits are merged to form tracks 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS 10 ID tracking speedup In a previous Demonstrator, only the first two step implemented • Raw detector data decoding and clusterization algorithms form the first step of reconstruction (data preparation) • Initial tests with Monte-Carlo samples shown up to 21x speed up compared to serial Athena job. • ~12x speedup with partial reconstruction made on GPU OFFLOAD is working! 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS 11 Calorimeter Topo-clustering • Topo-clustering algorithm classifies calorimeter cells by signal/noise – S/N>4 are Seeds – 4>S/N>2 are Growing cells – 2>S/N>0 are Terminal cells Standard algorithm • • Growing cells around Seeds are included until they don’t have any more Growing or Terminal cells around them Parallel implementation uses Cellular Automata algorithm to parallelize the task Parallel algorithm 1. Assign one thread for each cell 2. Start with adding Seeds to cluster. 3. At each iteration join to the cluster which contain highest S/N neighboring cell 4. Terminate when maximum radius is reached or can’t include anymore cells 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS 12 Muons Several steps of reconstruction relying on many tools in Athena: 1. Find segment in the Muon Spectrometer (<< Hough Transform) 2. Build tracks in the MS 3. Extrapolate to primary vertex obtaining a Stand-Alone muon 4. Combine MS muons with ID tracks for a Combined muon 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS 13 Previous HT-GPU study Circle detection using an algorithm based on Hough Transform (using GPGPU) Simulation of a typical central track detector: • Multi-layer cylindrical shape, with axis on the beamline and centered on the nominal interaction point • Uniform magnetic field with field lines parallel to the beamline • Charged particles will have helix trajectories (circles in the transverse plane wrt detector axis) • Charged particles HIT the detector • HITs position used to reconstruct the trajectories (this study done by Bologna people is NOT related to ATLAS GPU demonstrator) 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS 14 HT execution time vs CPU GPUs vs CPU GPUs only C++ CUDA OpenCL GPU%CPU speed up over 15x CPU timing scales with number of hits GPU timing almost independent on number of hits This was a stand-alone simulation, need to take into account experimental framework interface and data format conversion 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS 15 Summary and conclusions • Increasing instantaneous luminosity of LHC necessitates parallel processing • Massive parallelization of trigger algorithms on GPU is being investigated as a way to increase the computepower of the HLT farm • ATLAS developed Accelerator Process Extension (APE) framework to manage offloading from Trigger processes • Inner Detector Trigger Tracking algorithms have been successfully offloaded to GPU, resulting in ~12x speedup compared to CPU implementations (21x speedup for data preparation) • Further algorithms for ID, Calorimeter and Muon systems are being implemented (speedup increase expected) • Studies on Hough Transform algorithm execution on GPU promise good speed-up 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS 16 backup 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS 17 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS 18 CUDA 6.0 coding Hough Matrix filling (Vote): • 1D grid over hits ( grid dimension = number of hits) • At a given (f, q), threadblock over A . For each A, a corresponding B is evaluated • The MH(f, q, A, B) Hough-Matrix element is incremented by a unity with CUDA atomicAdd() • Matrix initialization once at first iteration with cudaMallocHost (pinned memory) and initialized on device with cudaMemset 10/09/14 L. Rinaldi - GPGPU for track finding and triggering in High Energy Physics 19 CUDA 6.0 coding Local Maxima search • 2D-grid over (f, q) Grid dimension: Nf×(Nq*NA*NB/maxThreadsPerBlock) • 2D-threadblock, with dimXBlock= NA, dimYBlock=maxThreadsPerBlock/NA • Each thread compares the MH(f, q, A, B) element to neighbours, the bigger is stored in the GPU shared memory and eventually transferred back. • I/O demanding – several kernel may access matrix together 10/09/14 L. Rinaldi - GPGPU for track finding and triggering in High Energy Physics 20
© Copyright 2025 Paperzz