Parallel data processing with PROOF

LOGO
Development of the distributed computing system for
the MPD at the NICA collider, analytical estimations
Gertsenberger K. V.
Joint Institute for Nuclear Research, Dubna
Mathematical Modeling and Computational Physics 2013
NICA scheme
MMCP’2013
Gertsenberger K.V.
2
Multipurpose Detector (MPD)
The software MPDRoot is developed for the event simulation,
reconstruction and physical analysis of the heavy ions’ collision
registered by MPD at the NICA collider.
MMCP’2013
Gertsenberger K.V.
3
Prerequisites of the NICA cluster
 high interaction rate (to 6 KHz)
 high particle multiplicity, about 1000 charged particles for the
central collision at the NICA energy
 one event reconstruction takes tens of seconds in
MPDRoot now, 1M events – months
 large data stream from the MPD:
100k events ~ 5 TB
100 000k events ~ 5 PB/year
 unified interface for parallel processing and storing of
the event data
MMCP’2013
Gertsenberger K.V.
4
Development of the NICA cluster
2 main lines of the development:
 data storage development for the experiment
 organization of parallel processing of the MPD events
development and expansion distributed cluster
for the MPD experiment based on LHEP farm
MMCP’2013
Gertsenberger K.V.
5
Current NICA cluster in LHEP for MPD
MMCP’2013
Gertsenberger K.V.
6
Distributed file system GlusterFS
aggregates the existing file systems in common
distributed file system
automatic replication works as background process
background self-checking service restores corrupted
files in case of hardware or software failure
implemented on application layer and working in user
space
MMCP’2013
Gertsenberger K.V.
7
Data storage on the NICA cluster
MMCP’2013
Gertsenberger K.V.
8
Development of the distributed computing system
NICA cluster
concurrent data processing
on cluster nodes
PROOF server
MPD-scheduler
parallel data
processing in a ROOT
macro on the parallel
architectures
scheduling system for
the task distribution to
parallelize data
processing on cluster
nodes
MMCP’2013
Gertsenberger K.V.
9
Parallel data processing with PROOF
 PROOF (Parallel ROOT Facility) – the part of the ROOT
software, no additional installations
 PROOF uses data independent parallelism based on the lack
of correlation for MPD events  good scalability
 Parallelization for three parallel architectures:
1. PROOF-Lite parallelizes the data processing on one
multiprocessor/multicores machine
2. PROOF parallelizes processing on heterogeneous computing
cluster
3. Parallel data processing in GRID
 Transparency: the same program code can execute both
sequentially and concurrently
MMCP’2013
Gertsenberger K.V.
10
Using PROOF in MPDRoot
 The last parameter of the reconstruction: run_type (default, “local”).
Speedup on the user multicore machine:
$ root reco.C(“evetest.root”, “mpddst.root”, 0, 1000, “proof”)
parallel processing of 1000 events with thread count being equal logical processor count
$ root reco.C (“evetest.root”, “mpddst.root”, 0, 500, “proof:workers=3”)
parallel processing of 500 events with 3 concurrent threads
Speedup on the NICA cluster:
$ root reco.C(“evetest.root”, “mpddst.root”, 0, 1000, “proof:[email protected]:21001”)
parallel processing of 1000 events on all cluster nodes of PoD farm
$ root reco.C (“eve”, “mpddst”, 0, 500, “proof:[email protected]:21001:workers=10”)
parallel processing of 500 events on PoD cluster with 10 workers
MMCP’2013
Gertsenberger K.V.
11
Speedup of the reconstruction on 4-cores machine
MMCP’2013
Gertsenberger K.V.
12
PROOF on the NICA cluster
event count
$ root reco.C(“evetest.root”,”mpddst.root”, 0, 3, “proof:[email protected]:21001”)
mpddst.root
*.root
GlusterFS
evetest.root
event №2
event №0
event №1
proof
proof
(8)
proof
(8)
proof = master server
proof = slave node
MMCP’2013
proof
(16)
proof
(16)
proof
(24)
proof
proof
(24)
(32)
Proof On Demand Cluster
Gertsenberger K.V.
13
Speedup of the reconstruction on the NICA cluster
MMCP’2013
Gertsenberger K.V.
14
MPD-scheduler
 Developed on C++ language with ROOT classes support.
 Uses scheduling system Sun Grid Engine (qsub
command) for execution in cluster mode.
 SGE combines cluster machines on LHEP farm into the
pool of worker nodes with 78 logical processors.
 The job for distributed execution on the NICA cluster is
described and passed to MPD-scheduler as XML file:
$ mpd-scheduler my_job.xml
MMCP’2013
Gertsenberger K.V.
15
Job description
<job>
<macro name="$VMCWORKDIR/macro/mpd/reco.C" start_event=”0” count_event=”1000” add_args=“local”/>
<file input="$VMCWORKDIR/macro/mpd/evetest1.root" output="$VMCWORKDIR/macro/mpd/mpddst1.root"/>
<file input="$VMCWORKDIR/macro/mpd/evetest2.root" output="$VMCWORKDIR/macro/mpd/mpddst2.root"/>
<file db_input="mpd.jinr.ru*,energy=3,gen=urqmd" output="~/mpdroot/macro/mpd/evetest_${counter}.root"/>
<run mode="local" count="5" config=“~/build/config.sh" logs="processing.log"/>
</job>
The description starts and ends with tag <job>.
Tag <macro> sets information about macro being executed by MPDRoot
Tag <file> defines files to process by macro above
Tag <run> describes run parameters and allocated resources
* mpd.jinr.ru – server name with production database
MMCP’2013
Gertsenberger K.V.
16
Job execution on the NICA cluster
job_reco.xml
<job>
job_command.xml
<macro name="~/mpdroot/macro/mpd/reco.C"/>
<job>
<file input="$VMCWORKDIR/evetest1.root"
output="$VMCWORKDIR/mpddst1.root"/>
GlusterFS
<command
line="get_mpd_production
energy=5-9 "/>
<file input="$VMCWORKDIR/evetest2.root" output="$VMCWORKDIR/mpddst2.root"/>
<file input="$VMCWORKDIR/evetest3.root"
output="$VMCWORKDIR/mpddst3.root"/>
<run mode="global" config="~/mpdroot/build/config.sh"/>
<run mode=“global" </job>
count=“3" config=“~/mpdroot/build/config.sh"/>
*.root
</job>
evetest3.root
evetest1.root
evetest2.root
MPD-scheduler
qsub
mpddst1.root job_command.xml
mpddst2.root
SGE
SGE
free
(8)
free
(8)
SGE = Sun Grid Engine server
SGE = Sun Grid Engine worker
MMCP’2013
mpddst3.root
SGE
SGE
SGE
free
(16)
busy
(16)
SGE
busy
(24)
SGE
SGE
busy
busy
(24)
(32)
SGE batch system
Gertsenberger K.V.
17
Speedup of the one reconstruction on NICA cluster
MMCP’2013
Gertsenberger K.V.
18
NICA cluster section on mpd.jinr.ru
MMCP’2013
Gertsenberger K.V.
19
Conclusions
 The distributed NICA cluster was deployed based on LHEP farm for
the NICA/MPD experiment (Fairsoft, ROOT/PROOF, MPDRoot,
Gluster, Torque, Maui). 128 cores
 The data storage was organized with distributed file system
GlusterFS: /nica/mpd[1-8]. 10 TB
 PROOF On Demand cluster was implemented to parallelize event
data processing for the MPD experiment, PROOF support was added
to the reconstruction macro.
 The system for the distributed job execution MPD-scheduler was
developed to run MPDRoot macros concurrently on the cluster.
 The web site mpd.jinr.ru in section Computing – NICA cluster
presents the manuals for the systems described above.
MMCP’2013
Gertsenberger K.V.
20
LOGO
Analytical model for parallel processing on cluster
Sp n =
n
+ T1
W
+ 1 + BD ∗ T1
BD ∗ Pnode ∗ 2 ∗
n ∗ Pnode
speedup for point (data independent)
algorithm of image processing
Pnode – count of logical processors, n – data to process (byte), ВD – speed of the data access (MB/s), T1 –
“pure” time of the sequential processing (s)
MMCP’2013
Gertsenberger K.V.
22
Prediction of the NICA computing power
How many are logical processors required to process NTASK physical analysis tasks and one
reconstruction within Tday days in parallel?
Pnode =
n + BD ∗ T1
BD ∗ Tpar – n
Pnode (NTASK ) =
n1 ∗ (NTASK + 1) ∗ NEVENT + BD ∗ (TPA ∗ NTASK + TREC ) ∗ NEVENT
BD ∗ (Tday ∗ 24 ∗ 3600)– n1 ∗ (NTASK + 1) ∗ NEVENT
If n1 = 2 MB, NEVENT = 10 000 000 events, TPA = 5 s/event, TREC = 10 s/event., BD = 100 MB/s, Tday = 30 days
MMCP’2013
Gertsenberger K.V.
23