FPGA co-processor for the ALICE High Level Trigger

FPGA Co-processor for the
ALICE High Level Trigger
Gaute Grastveit
University of Bergen
Norway
H.Helstrup1, J.Lien1, V.Lindenstruth2, C.Loizides5, D.Roehrich3, B.Skaali4,
T.Steinbeck2, K.Ullaland3, A.Vestbo3, T. Vik4, A. Wiebalck2
for the ALICE Collaboration
1Bergen
College, Norway
Institute for Physics, University of Heidelberg, Germany
3Departement of Physics, University of Bergen, Norway
4Departement of Physics, University of Oslo, Norway
2Kirchhoff
5Institute
of Nuclear Physics, University of Frankfurt, Germany
ALICE
– A Large Ion Collider Experiment
TPC
- Time Projection Chamber
Very High Data Rate
Pb-Pb central collisions
Event rate: 200Hz
Event size: ~75Mb
=> 15 Gbyte/s
Max data-rate to tape is
1.25 Gbyte/s
Compression/selection is needed
Conventional, lossless methods: factor 2
HLT functionality
• Compress
• Reduce the amount of data required to encode the event
as far as possible without loosing physics information
• Trigger
• Accept/reject events on the basis of physics application
• Select
• Select regions of interest within an event
• remove pile-up in p-p
• ...
Task: reconstruct the tracks of 20.000 charged
particles (each producing 150 clusters) in the TPC
Timebudget: 5 ms
The HLT setup
Data are received in parallel
216x320 MB/s
216x100 MB/s
RORC
DDL
reveiver Buffer
> 1000 Events
RcvBd
PCI
NIC
RCU – Readout Controller Unit
DDL – Data Detector Link
ALTRO
TPC FEE Buffer
(8 Events)
RORC
DDL
RORC – ReadOut Reciver Card
RCU
HLT
farm
reveiver Buffer
> 1000 Events
RcvBd
PCI
NIC
•PCI kernel in the FPGA
•FPGA will also be utilised
for pattern recognition
•Reduces number of CPU’s
needed
The HLT FPGA co-processor
• FPGA: APEX 20K400
• Next prototype: Altera Stratix FPGA
– Large internal memory
– DSP cores
Two Schemes for Finding Tracks
•Low occupancy
(p-p, Pb-Pb outer padrows)
•Conventional approach with (2d)
cluster finder and track follower
•High occupancy
(overlapping clusters):
•Hough transform on raw data
•Cluster analysis for deconvolution
•(Kalman filter)
High
multiplicity
picture
Cluster Finder
time
The numbers represent Charge (ADC values)
A vertical uninterrupted stack of numbers is
called a sequence. The square shows the
geometric centre of the sequence.
Neighbouring sequences belong to the same
Cluster.
Final mean value:
 charge  scalevalue 
 charge
(Weighted mean)
Pad
FPGA implementation of a cluster
finder - the algorithm
• Calculate the mean for every
sequence
• Adjacent pads with similar
means are merged
• Two lists of sequences are
used: one for clusters on the
previous pad one for clusters
on the current pad
• Clusters are removed from the
searchrange when a match is
found or we know it is finished
• Clusters are inserted in the
inputrange after merging or
when we start a new cluster
Memory of clusters
begin
Searchrange /
Previous pad
end
Inputrange /
Current pad
insert
Block Diagram, Verification
Testbench
Top structure
RAM (lpm)
T
Decoder
seq
FIFO (lpm)
seq
File: charges
C++ model
Merger
cluster
File: VHDL clusters
File: C++ clusters
C++ program
compares
the results
Relative Scales
As before
the mean is
calculated by:
 charge  scalevalue 
 charge
smaller
+ Smaller numbers, only multiplies by <11
- Multiplication can’t be done until merging takes place
Alternative, (absolute):
Decoder
FIFO (lpm)
Pre_Calc
(2 mult, 1 add)
Merger
Deconvolution
Simplified implementation, almost for free – splits at
minima in both directions (time and pad)
off
on
Merger Goals
•spend few clock cycles per sequence
Clock cycles spent in the different states
•use few logic elements
6%
•high clockspeed
&
new data
30 %
22 %
&
next pad
send
many
new row or
skip pad
5%
merge
store
W
0%
send
all
idle
11 %
4%
new search range
11 %
&
11 %
idle - 30%
merge_mult
empty
merge
add
++
insert
seq
W
merge_add
send
one
&
merge_store
send_all
send_many
old is above
send_one
old is below
merge
mult
**+
within match distance
calc
dist
--
calc_dist
insert_seq
Cluster Finder Performance
•Syntesized on Altera APEX
•Uses 1800 Logic Elements
(11%)
•Memory usage 16*80 + 64*112= 8448 bits
•Circuit runs at 33Mhz
(4%)
Outlook
Implementation of Hough transformation
Back Linked List
(ALTRO sequences)
Detector Data Link
Detector Data Link
TPC coordinates
(Padrow, Pad, Time)
Data Format
Data Format
Decoder
Decoder
Local coordinates
(X, Y, Z)
(A,B,E)
XYZ
XYZ
Transformer
Transformer
ABE
ABE
Transformer
Transformer
Parameter Space
(k,phi,eta-index)
Histogram 1
Histogram 1
Histogram 2
ADC count
10-to-8
10-to-8Bit
Bit
Converter
Converter
..
..
..
Histogram N-1
Histogram N-1
Histogram N
Histogram N
Find
Find
Maxima
Maxima
Conclusion
We have demonstrated the feasibility
of a real time cluster finder implemented
in an FPGA
Firmware implementation of a Hough
transform looks promising
transperacy replacements from
now on
ALICE
– A Large Ion Collider Experiment
TPC - Time Projection Chamber
18 sectors on
each side, each
sector is
readout in 6
subsectors
Total is ca.
570.000 pads