Presentazione di PowerPoint - Agenda INFN

Triggering events
with GPGPU in ATLAS
L. Rinaldi
(Università di Bologna and INFN)
Institutes and people
•
•
•
•
•
•
•
•
•
SMU, Dallas, TX, USA
INESC-ID, Universidade de Lisboa, Portugal
LIP, Lisboa, Portugal
STFC and Rutherford Appleton Lab, GB
Sapienza Universita` di Roma, Italy
AGH Univ. of Science and Technology, Krakow, Poland
Istituto Nazionale di Fisica Nucleare (INFN-Bologna), Italy
Universita` di Bologna, Italy
INFN-CNAF, Italy
Italian people:
L. Rinaldi (BO), M. Bauce (RM1), A. Messina (RM1), M. Negrini (BO), A. Sidoti
(BO), S. Tupputi (CNAF)
Not-ATLAS but helping a lot: F. Semeria (INFN-BO), M. Belgiovine (UNIBO)
27/05/15
L. Rinaldi – Triggering with GPU in
ATLAS
1
The ATLAS trigger and DAQ
• Composed of hardware
based Level-1 (L1) and
software based High
Level Trigger(HLT)
• Reduces 40MHz input
event rate to about 1kHz
( ~1500MB/s output)
• L1 identifies interesting
activity in small
geometrical regions of
detector called Region
of Interests (RoI)
• RoIs identified by L1 are
passed to the HLT for
event selection
27/05/15
With about 25k processes
~25k/100kHz = ~250ms/Event decision time
L. Rinaldi – Trigger con GPU in ATLAS
2
High Level Trigger (HLT)
•
•
•
•
•
•
•
HLT uses Stepwise processing
of sequences called Trigger
Chains(TC)
Each chain is composed of
Algorithms and seeded by RoIs
from L1
Same RoI may seed multiple
chains
Same Algorithm may run on
different data
An algorithm runs only once on
same data (caching)
If a RoI fails a selection step
further algorithms in chain are
not executed
Initial algorithms (L2) work on
partial event data (2-6%). Later
algorithms have access to full
event data
27/05/15
L1
ROI
L1
ROI
L1
ROI
L2 Alg
L2 Alg
L2 Alg
L2 Alg
L2 Alg
L2 Alg
L2 Alg
L2 Alg
L2 Alg
L2 Alg
L2 Alg
L2 Alg
L2 Alg
L2 Alg
L2 Alg
L2 Alg
Event Building
EF
Alg
EF
Alg
EF
Alg
EF
Alg
EF
Alg
EF
Alg
EF
Alg
EF
Alg
EF
Alg
EF
Alg
EF
Alg
EF
Alg
EF
Alg
EF
Alg
EF
Alg
EF
Alg
EF
Alg
EF
Alg
EF
Alg
EF
Alg
L. Rinaldi – Trigger con GPU in ATLAS
3
The ATLAS Detector
Run- 1 CPU measurements
Combined HLT:
• 60% Tracking
• 20% Calo
• 10% Muon
• 10% other
ATLAS is developing and
expanding a demonstrator
exploiting GPUs in:
• Inner Detector (ID)-Tracking
• Calorimeter-Topo-clustering
• Muon-Tracking
to evaluate the potential benefit of
GPUs in terms of throughput per
unit cost
27/05/15
L. Rinaldi – Trigger con GPU in ATLAS
4
ATLAS GPU Demonstrator
• Increasing LHC
instantaneous luminosity is
leading to higher pileup
• CPU time rises rapidly with
pile-up due to combinatorial
nature of the tracking
• HLT farm size is limited,
mainly by power and cooling
• GPUs provide good
power/compute ratio
27/05/15
L. Rinaldi – Trigger con GPU in ATLAS
5
Offloading Mechanism
• Trigger Processing Units integrate
ATLAS offline software framework,
ATHENA, to online environment
• Many PU processes run on each
Trigger host
• A client-server approach is
implemented to manage resources
between multiple PU processes.
• PU prepares data to be processed
and sends it to server
• Accelerator Process Extension
(APE) Server manages offload
requests and executes kernels on
GPU(s)
• It sends results back to process
that made the offload request
27/05/15
Trigger
Trigger
Trigger
PU
PU
Trigger
PU
(ATHEN
(ATHEN
PUA)
(ATHEN
A) )
(ATHENA
A)
Data+Metadata
Results+Metada
ta
APE
Server
Server can support different
hardware types (GPUs, Xeon-Phi,
CPUs) and different
configurations such as GPU/Phi
mixtures and in-host off-host
accelerators.
L. Rinaldi – Trigger con GPU in ATLAS
6
Offloading - Client
3.
4.
5.
27/05/15
HLT
Algoritm
1
Athena EDM
5
Data+MetaData
TrigDetAccelSvc
OffloadSvc
3
2
GPU EDM
Export
Tool
Export
Tool
Export tools and server
communication need to be
fast (serial sections)
L. Rinaldi – Trigger con GPU in ATLAS
4
Result
2.
HLT Algorithm asks for offload
to TrigDetAccelSvc
TrigDetAccelSvc converts C++
classes for raw and
reconstructed quantities from
the Athena Event Data Model
(EDM) to GPU optimized
EDM through Data Export
Tools
Then it adds metadata and
requests offload through
OffloadSvc
OffloadSvc manages multiple
requests and communication
with APE server
Results are converted back to
Athena EDM by
TrigDetAccelSvc and handed
to requesting algorithm
Request
1.
Implemented for each detector
such as TrigInDetAccelSvc
APE
Server
7
Offloading - Server
•
•
•
•
APE Server uses plug-in mechanism It is
composed of
Manager handles communication with
processes and scheduling
– Receives offload requests
– Passes the request to appropriate
Module
– Executes Work items
– Sends the results back to requesting
process
Modules manage GPU resources and
create Work items
– Manage constants and time-varying
data on GPU
– Bunch multiple requests to optimize
utilization
– Create appropriate work items
Work items are composed of the kernels
which run on GPU and prepare results
Athena
Data +
Metadata
Work
Module
Todo
Queue
APE Server
Manager
Complete
d Queue
Athena
Results +
Metadata
Each detector implements their own Module
27/05/15
L. Rinaldi – Trigger con GPU in ATLAS
8
ID tracking Module
Tracking is most time consuming part of trigger, done in four steps
ByteStream
Decoding
Hit Clustering
Track
Formation
Clone
Removal
Each thread works
on a word and
decodes it. Data
contains hits on on
different ID layers
Bytestream decoding converts
detector encoded output to hits
coordinates
Charged particles typically
activate one or more
neighboring strips or pixels.
Hit clustering merges these
by a Cellular Automata
algorithm
27/05/15
Each hit start as an
independent cluster. Adjacent
clusters are merged in each
step until all adjacent cells
belong to same cluster. Each
thread works on a different hit
L. Rinaldi – Trigger con GPU in ATLAS
9
ID tracking Module
Tracking starts with pair forming
in inner layers.
A 2D thread array checks for
pairing condition and selects
suitable pairs
Then these pairs are extrapolated to outer
layers by a 2D thread block to form triplets.
Finally triplets are combined to form Track
Candidates.
In clone removal step, track candidates starting from different pairs but
having same outer layer hits are merged to form tracks
27/05/15
L. Rinaldi – Trigger con GPU in ATLAS
10
ID tracking speedup
In a previous Demonstrator, only the
first two step implemented
• Raw detector data decoding and
clusterization algorithms form the
first step of reconstruction (data
preparation)
• Initial tests with Monte-Carlo
samples shown up to 21x speed up
compared to serial Athena job.
• ~12x speedup with partial
reconstruction made on GPU
OFFLOAD is working!
27/05/15
L. Rinaldi – Trigger con GPU in ATLAS
11
Calorimeter Topo-clustering
•
Topo-clustering algorithm classifies
calorimeter cells by signal/noise
– S/N>4 are Seeds
– 4>S/N>2 are Growing cells
– 2>S/N>0 are Terminal cells
Standard algorithm
•
•
Growing cells around Seeds are
included until they don’t have any more
Growing or Terminal cells around them
Parallel implementation uses Cellular
Automata algorithm to parallelize the
task
Parallel algorithm
1. Assign one thread for each cell
2. Start with adding Seeds to
cluster.
3. At each iteration join to the
cluster which contain highest
S/N neighboring cell
4. Terminate when maximum
radius is reached or can’t
include anymore cells
27/05/15
L. Rinaldi – Trigger con GPU in ATLAS
12
Muons
Several steps of reconstruction relying on many tools in Athena:
1. Find segment in the Muon Spectrometer (<< Hough Transform)
2. Build tracks in the MS
3. Extrapolate to primary vertex obtaining a Stand-Alone muon
4. Combine MS muons with ID tracks for a Combined muon
27/05/15
L. Rinaldi – Trigger con GPU in ATLAS
13
Previous HT-GPU study
Circle detection using an
algorithm based on Hough
Transform (using GPGPU)
Simulation of a typical central track
detector:
• Multi-layer cylindrical shape, with
axis on the beamline and
centered on the nominal
interaction point
• Uniform magnetic field with field
lines parallel to the beamline
• Charged particles will have helix
trajectories (circles in the
transverse plane wrt detector
axis)
• Charged particles HIT the
detector
• HITs position used to reconstruct
the trajectories
(this study done by Bologna people
is NOT related to ATLAS GPU
demonstrator)
27/05/15
L. Rinaldi – Trigger con GPU in ATLAS
14
HT execution time vs CPU
GPUs vs CPU
GPUs only
C++
CUDA
OpenCL
GPU%CPU speed up over 15x
CPU timing scales with number of hits
GPU timing almost independent on number of hits
This was a stand-alone simulation, need to take into account experimental framework
interface and data format conversion
27/05/15
L. Rinaldi – Trigger con GPU in ATLAS
15
Summary and conclusions
• Increasing instantaneous luminosity of LHC necessitates
parallel processing
• Massive parallelization of trigger algorithms on GPU is
being investigated as a way to increase the computepower of the HLT farm
• ATLAS developed Accelerator Process Extension (APE)
framework to manage offloading from Trigger processes
• Inner Detector Trigger Tracking algorithms have been
successfully offloaded to GPU, resulting in ~12x
speedup compared to CPU implementations (21x
speedup for data preparation)
• Further algorithms for ID, Calorimeter and Muon systems
are being implemented (speedup increase expected)
• Studies on Hough Transform algorithm execution on
GPU promise good speed-up
27/05/15
L. Rinaldi – Trigger con GPU in ATLAS
16
backup
27/05/15
L. Rinaldi – Trigger con GPU in ATLAS
17
27/05/15
L. Rinaldi – Trigger con GPU in ATLAS
18
CUDA 6.0 coding
Hough Matrix filling (Vote):
•
1D grid over hits
( grid dimension = number of hits)
• At a given (f, q), threadblock over A .
For each A, a corresponding B is evaluated
• The MH(f, q, A, B) Hough-Matrix element is incremented by
a unity with CUDA atomicAdd()
• Matrix initialization once at first iteration with
cudaMallocHost (pinned memory) and initialized on device
with cudaMemset
10/09/14
L. Rinaldi - GPGPU for track finding and
triggering in High Energy Physics
19
CUDA 6.0 coding
Local Maxima search
• 2D-grid over (f, q)
Grid dimension: Nf×(Nq*NA*NB/maxThreadsPerBlock)
• 2D-threadblock, with dimXBlock= NA, dimYBlock=maxThreadsPerBlock/NA
• Each thread compares the MH(f, q, A, B) element to neighbours, the bigger is
stored in the GPU shared memory and eventually transferred back.
• I/O demanding – several kernel may access matrix together
10/09/14
L. Rinaldi - GPGPU for track finding and
triggering in High Energy Physics
20