O2_SimulationInput_v2

Version 2
9 –February – 2017
The Input for CWG -3 Simulation Activities
This document aims to describe the design of the computing systems and the necessary input
parameters and data tests measurements necessary for the O2 simulation activities.
The current main simulation tasks are:
1) The internal CRU – FLP data flow for continuous read-out mode and trigger mode
2) The FLP – EPN switching fabric for building the full time frames.
3) The simulation of the O2 “offline” workflow.
1. The CRU – FLP Data Flow:
The O2- DAQ system will have to operate in two main modes: in trigger mode and in continuous mode.
Changing between the two operational modes should be done easily as a configuration parameter. The
logical diagrams and the main blocks involved in processing data in these two modes are presented in
Figure 1 – trigger mode and Figure 2 – continuous mode.
Figure 1 The Trigger mode operation for the CRU
Figure 2 The logical flow and diagram for the continues mode operation for the CRU F
The schematic view for the CRU logical block diagram to be used for simulation is presented in Figure 3.
The data processing units (BC – Base line correction, ZS – Zero Suppression, CF- Cluster Finding) can be
added or removed from the data flow . Data processing in the CRU is done synchronously with the
input rate. Small data buffers ( 4KB or 8KB) are used to store the data before sending them via the PCIe
bus to the memory of the FLP node.
Figure 3 The CRU – FLP schematic view for the data flow simulation. The modules BC, ZS, and CF can added or removed in the
data flow simulation.
The data flow in managed by HB signal for the continuous mode and the trigger signal for the trigger
mode. Figure 4 presents the way these signals are distributed to the CRU units and then to the FLP
nodes. Each FLP node will have two CRU units connected to the PCIe X16.
Figure 4 The control signaling for the DAQ – CRU system
The main paramters for the simulation of the CRU – FLP data Flow :
-
-
Input rate : up to 40 GBP links running at 4 Gbps
Reduction data rate in continous mode by ZS and CF : factor 3 to 5
The maximum memory avaible to buffer data in CRU : 1MB
PCIe availability ; How PCIe bus is availabe to the data transfer unit in CRU . We assume that we
can use ~ 90% of the PCIe bus capacity. The periods when the PCIe bus is used by the OS are
beeing measured and their probility distribution in terms of lenghts and rate will be used in the
simulation (preliminary time length probility distribution in how the pCIe is used by the OS are
shown in Figure 5 ).
The latency for BC, ZS and CF processing in CRU
The control signal latecncy :
1 micro-second to send busy signal from CPT to CRU
4 micro-sencod to send the full signal from CRU to CPT
The simulation studies should provide :
-
-
Precise estimation for the statistical probability for transfering all the data to the FLP memory
for different asumptions on data reduction rate and PCIe usage. As there are ~ 500 CRU units
which work totally independent the probaility to have an incoplete time frame needs to be
corectly estimated.
Help in design of the control signaling mechasim for trigger mode and continous mode. Add this
information to the FLP control system.
Figure 5 Preliminary data for the PCI bus usage by the OS. The included plot shows a zoom in the tail of the probability
distribution for the time intervals where the PCIe bus is used by OS. These measurements were done by Filippo Costa and his
colleagues.
2. The full time frame building
The task of collecting the full time frame is presented in Figure 7. Each FLP unit sends its data buffer
associated with one time frame to the same EPN node. To be able to keep up with the high input rate
many such time frame building data transfers should be done in parallel.
Figure 6. A schematic views of the data flow to create complete time frames on the EPN nodes.
A generic topology for the full time building is presented in Figure 7. In this scheme there are ~ 250 FLP
processing units that collect sub -detector time frames and using a control algorithm send the data to
assemble the full time frame in one EPN node. We would like to consider two main options for the
location of the EPN processing farm:
-
-
When it is at the same location as the FLP nodes. In these case switching topologies in which FLP
and EPN nodes are connected into leave switching units is possible. Such possibilities are
described in O2 – Technical Designed Report 1 . In this case the SEPN ( Super- EPN nodes) are not
necessary but they can be considered as an optimization possibility to reduce the number of
high speed links and the buffer memory on each FLP node.
When the FLP and EPN farms physically separated. In this case we need to consider the SEPN
units which can be computing nodes or switching units. The data flow control mechanism will
be different if the SEPNs are computing nodes or switching units). The data transport between
the two locations will be Ethernet and the number of wavelength used should be adapted to
keep up with the total data flow and to scale for estimated future needs. The Network Fabric
that help in the fan in part to collect all the data frames can use different network technologies
(infiniband , omni-ptah … )
Figure 7. The generic view for the Time Frame building system.
1
https://aliceo2.web.cern.ch/system/files/ALICE_O2_TDR_2015_04_21.pdf
Input data and Test-bed measurements necessary for simulation:
-
-
-
Probability distribution for the size of data frames from the TPC Detector
Probability distribution for the size of data frames from the Other Detectors. An example used in
previous simulations is shown in Figure 8. Based on these values the distribution for the size of
the full time frames is shown in Figure 9.
Time Frame rate : 50 kHz.
Delay distribution for messages ( vs size ) for the case when FLP and EPNs are at the same
location and for the case when they are at different locations ( e.g. ~ 10 km or 20 -30 microseconds latency)
Transfer rate on the computing nodes and in switches (on each port) on both direction – high
sampling rate
CPU usage ( sys, int, usr ) for different IO technologies ( Ethernet , Infiniband, omni-path …. )
Memory to memory tests ( localhost ) for transfer rate
TCP parameters / tuning (congestion control, buffer size , frames…. )
Figure 8 The distribution for the data frame size. The two picks describe the TPC detector ( there are ~ 170 FLP units serving the
TPC ) and the Other Detectors ( ~ 80 FLP units used for the Other Detectors)
Figure 9 The full time frame size distribution obtain by using the distributions from Figure 8
The simulation studies should help to:
-
Correctly describe different switching technologies and the data flow in the entire system
Optimize the design of the switching fabric and to help in finding a cost effective solution
Correctly identify the scaling limits for such a system
Help in designing the flow control protocol based on real-time monitoring data
3. The simulation of the O2 “offline” workflow
This simulation study aims to describe the global workflow for the offline processing and data storage
for the RUN3 period of the ALICE experiment. The workflow and initial global infrastructure is described
in the ALICE TDR, June 2015 Chapter 4 - Computing model.