Data production using CernVM and LxCloud Dag Toppe Larsen

Data production using CernVM and LxCloud
Dag Toppe Larsen
Warsaw, 2014-02-11
Outline
●
CernVM/LxCloud data production
●
Automatic data production
●
Data production management
●
Production database
●
Web interface
2
CernVM cluster at LxCloud
●
●
Requested and obtained new “NA61” project on
final production Lxcloud service
●
Same quota (200VCPUs/instances) as before
●
Access controlled by new e-group “na61-cloud”
●
Migration completed
Software currently used:
●
Legacy: 13e
●
Shine: v0r5p0
●
●
Software, databases & calibration data distributed
via CvmFS
Mass production of BeBe160 (11_040) to
3
Test production
●
Recently, a new BeBe160 test production was
submitted to CernVM running on LxCloud
●
Job description file created by automatic data production manager
–
●
●
●
●
But manually submitted to CernVM cluster
Output written to
/castor/cern.ch/na61/prod/Be_Be_158_11
/040_13e_v0r5p0_pp_cvm2_phys
To be compared to
/castor/cern.ch/na61/11/prod/13E040
(Same legacy, shine, global key, mode)
For some reason, legacy software does not
enter event loop (next slide)
●
Shine part of processing appear to work OK though4
CernVM production error
●
Should have got:
<StdUnmark:> Unmarking...
DSPACK 1.602, 1 Aug 2007
(dswrite, server: dag_28311_lxplus0099)
Staging dataset: bos:/afs/cern.ch/work/d/dag/test/run-014923x023.bos
DSPACK 1.602, 1 Aug 2007
(dsopen, server: dag_28311_lxplus0099)
Input file: /tmp/R.28582.fifo
Read definitions
DSPACK 1.602, 1 Aug 2007
(dsread, server: dag_28311_lxplus0099)
Read one event
________________________________________________________________________________
Run: 14923 Event: 1896087552
________________________________________________________________________________
●
But got:
<StdUnmark:> Unmarking...
DSPACK 1.602, 1 Aug 2007
(dswrite, server: na61_31426_server-31edd847-e7c4-4a9a-a968-d52264f87fed)
Staging dataset: bos:/home/condor/execute/dir_31365/run-014923x028/run-014923x028.bos
DSPACK 1.602, 1 Aug 2007
(dsopen, server: na61_31426_server-31edd847-e7c4-4a9a-a968-d52264f87fed)
Input file: /tmp/R.31686.fifo
DS_OPEN_TOOL Error: No definition block
Finishing....
DSPACK 1.602, 1 Aug 2007
(dskill, server: na61_31426_server-31edd847-e7c4-4a9a-a968-d52264f87fed)
●
What does “DS_OPEN_TOOL Error: No definition block” mean?
●
●
●
Did not get this error when producing data on CernVM in the past
If the exact same production script is ran on LxPlus, using software from CvmFS (also
mounted on Lxplus/batch), it works fine
Some missing file (that is found on AFS in the case of LxPlus)?
5
Automatic data production
6
Production DB
●
Production DB has grown a bit beyond what
was originally intended
●
●
●
Difficult to work with the production information
without a proper SQL database
Tedious to access information from Castor and
bookkeeping DB
Elog data not always consistent (needed to be
standardised)
–
●
Elog data needed as input for data production (magnetic
field)
Created a sqlite DB with three tables: run,
production and chunkproduction
7
Production DB schema
●
runs
●
●
●
●
All information for
given run
Primary key: run
Fields target,
beam, momentum,
define reaction run
belongs to
Information
imported from elog
via bookkeeping
DB
8
runs table
●
●
●
Contains all information for given run
Fields beam, target, momentum & year define
which reaction run belongs to
Information imported from eLog via
bookkeeping database
●
●
All eLog information for all runs is imported
Elog information is processed and stored in
separate fields
●
●
Including fields defining the reaction
Original eLog entry also stored to allow later
reprocessing
9
chunkproductions table
●
●
●
Stores all chunks
produced
Associated to
production, run and
chunk
●
production: e.g. 1
●
run: e.g. 123456
●
chunk e.g. 123
●
Has potential to
contain order of 10^6
rows
●
●
By far largest table in
DB
Potential performance
●
rerun: number of
times chunk has
failed and been
reprocessed
status: waiting /
processing / checking
/ ok / failed (numeric 10
values)
productions table
●
●
A unique combination
of target, beam,
momentum, year, key,
legacy, shine, mode,
os, source, type is a
production
Primary key
production
●
Auto-generated
unique number
●
production: e.g. 1
●
target: e.g. Be
●
beam: e.g. Be
●
momentum: e.g. 158
●
year: e.g. 11
●
key: e.g. 040
●
legacy: e.g. 13c
●
shine: e.g. v0r5p0
●
mode: e.g. pp
11
Automated data production system
commands
./na61prod
Usage:
./na61prod <command> <key=value>
<command> one of:
elogImport
- import all elog information from bookkeeping
elogConvert
- process elog information and fill database
setProduction - register new production in database
produce
- start new production
check
- check, resubmit and update database for errors
setRunOk
- mark runs as OK
<key=value> any of:
runs
- list and/or range of runs
[all]
type
- prod or test
[prod]
beam
- beam type
No default value
target
- target type
No default value
momentum
- beam momentum
No default value
year
- year of data taking
No default value
key
- global key (no year)
[latest]
legacy
- version of legacy software
[latest]
shine
- version of Shine software
[latest]
mode
- pp or pA
[pp]
os
- cvm2 or slc6
[cvm2]
source
- phys or sim
[phys]
path_in
- path to data (for sim. Data) [root://castorpublic.cern.ch//castor/cern.ch/na61]
comment
- free-text production comment []
ok
- 0 or 1
[1]
<command> one of:
setNameValue - set possible value for key-value pair
<key=value> any of:
name
- type, legacy, shine, mode, os, source, path_in, path_out or path_layout
value
- value corresponding to name []
pref
- preferred value, 0 or 1
[1]
The system will choose [default] values for keys that are not set.
12
Data production command usage
●
na61prod command=elogImport runs=700018000
●
●
na61prod command=elogConvert runs=all
●
●
Will obtain eLog information for all runs in this range
Process imported eLog information and fill relevant
fields in runs table
na61prod command=setProduction beam=Be
target=Be momentum=158 year=11
comment=”New TPC calibration data.”
●
Registers a new production in the production table
using default values
13
Automatic data production manager
status
●
●
●
●
Can generate the files needed for submitting
jobs (both LxBatch & CernVM)
Now uses native SqLite language bindings for
better performance
Named value key pair table implemented to
store allowed/default values for production
parameters
Part being worked on:
●
Automatic submitting/checking/resubmitting jobs
●
Not “difficult”, but rather “tedious”
14
Web interface
15
Web interface
●
Web interface to production DB
●
●
●
Experimenting with best interface/usability for
different use cases
Currently can only display information
●
●
●
http://cern.ch/na61cld/cgi-bin/prod
Will add ability to log in for starting productions, etc
Working on script that will import information
about already existing productions into
database
Can generate list of chunks from set of filtering16
criteria
General plan forward
●
●
●
Complete CernVM test BeBe160 test
production
Finish the automatic
submission/checking/resubmit of jobs for
automatic data production manager
Add possibility to submit jobs from web
interface
17
Proposal to migrate software, calibration data & databases to CvmFS
●
CvmFS is based on the HTTP protocol
●
●
●
Distributed globally via an hierarchy of cache
servers
Files are compressed on server side
Downloaded on-demand, decompressed and semipermanently cached on client side
–
●
●
A bit slow first time a software is ran (to allow for software
download), but at native speeds at later runs
Originally developed to distribute software to
CernVM virtual machines
Has gained popularity on conventional (nonvirtualised) computing clusters as well
18